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1 INTRODUCTION 


A . Background 

This  project  was  established  to  study  the  relevance  of  linear  pre- 
dictive coding  ( LPC ) estimation  techniques  for  the  development  of  a prac- 
tical, real-time  system  for  transmitting  digitized  voice  signals.  These 
techniques  had  been  shown  to  provide  excellent  quality  transmission  at 
modest  bit  rates  when  simulated  on  large-scale  digital  computers.  Our 
goal  was  to  determine  how  they  can  be  used  in  packet  communication  systems 
with  smaller  computers. 

During  the  first  year  of  the  project,  our  perspective  on  the  prob- 
lem changed  in  three  ways.  First,  it  became  apparent  that  achieving  high 
quality  was  of  paramount  importance  and  that  the  computational  load  was 
not  as  critical  as  originally  anticipated.  The  rapid  development  of  high- 
speed large-scale  integrated  circuits  (LSI)  has  made  it  possible  to  achieve 
remarkable  computational  capabilities  today,  and  the  projections  for 
future  developments  are  even  more  promising.  In  addition,  most  of  the 
LPC  approaches  offer  roughly  comparable  computational  loads,  since  the 
major  amount  of  computation  is  in  the  calculation  of  autocorrelation  co- 

cients  and  in  the  synthesizing  filter.  Thus,  we  decreased  our  emphasis 
on  the  number  of  computations  per  second. 

Second,  as  the  program  progressed,  more  literature  on  the  effect  of 
quantization  accuracy  requirements  became  available.  We  were  able  to 
adopt  the  major  results  and  the  most  promising  techniques  from  this  re- 
search and,  accordingly,  to  reduce  our  own  efforts  in  this  area. 

Third,  the  importance  of  accurate  pitch  for  high  quality  synthesis 
and  the  difficulty  of  the  pitch-extraction  problem  became  apparent  early 


in  the  program.  The  high  quality  of  the  original  LPC-synthesized  speech 
resulted  from  the  accuracy  of  hand-marked  pitch  pulses  as  well  as  the 
inherent  advan  ages  of  the  LPC  technique  itself.  Furthermore,  pitch 
extraction  from  the  residual  was  far  more  complex  than  the  original 
papers  implied.  As  a result,  work  on  pitch  extraction  was  established 
as  Task  3 research  under  this  contract,  and  major  effort  was  directed 
toward  the  study  of  the  excitation  function. 

B • Summary  of  Areas  Studied  During  Task  2 Research 
1 • Asynchronous  Operation 

This  research  was  directed  toward  the  development  of  an  LPC- 
speech  digitization  technique  that  is  compatible  with  the  asynchronous 
operational  mode  of  packet  communication  systems.  Since  previous  re- 
search on  LPC  techniques  had  been  concerned  exclusively  with  synchronous 
systems,  a major  part  of  our  effort  was  devoted  to  the  study  of  asynchro- 
nous operation.  The  result  is  the  adaptive  data  compression  algorithm 
DELCO,  described  in  detail  in  Magill  (1973),  a copy  of  which  is  attached 
as  Appendix  A to  this  report.1*  This  algorithm  is  specifically  designed 
to  function  with  and  take  advantage  of  the  characteristics  of  an  asyn- 
chronous data  channel.  DELCO  offers  a data  compression  factor  between 
2:1  and  3:1  beyond  that  achieved  by  standard  LPC  approaches,  with  no 
degradation  in  voice  quality.1  Thus,  neglecting  the  overhead  of  the 
packet  communication  system,  we  can  transmit  speech  digitally  between 
1200  and  2400  baud. 


References  are  listed  at  the  end  of  this  report. 

An  additional  2:1  data  compression  is  achieved  in  an  asynchronous  sys- 
tem, since  no  channel  capacity  is  allocated  for  listening  as  in  fixed- 
channel  assignment  systems.  That  is,  it  is  possible  to  capitalize  on 
the  less  than  50  percent  average  duty  factor  in  a two-way  conversation 
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This  excellent  performance  is  obtained  by  two  means.  First, 
the  pauses  in  speech  are  recognized  by  a TASI-type  speech  detector  and 
are  not  encoded  or  transmitted.  Second,  periodic  waveforms,  such  as 
occur  in  steady-state  vowels,  are  recognized;  LPC  coefficients  are  trans- 
mitted only  when  new  values  are  required — i.e.,  when  the  vocal  tract 
configuration  changes  significantly.  The  need  for  coefficient  updating 
is  determined  from  the  ratio  of  the  residual  energies  formed  with  the 
previous  LPC  parameters  and  the  optimum  parameters.  Note  that  these  two 
operations  do  not  significantly  increase  the  number  of  computations,  so 
the  ability  to  achieve  real-time  operation  is  not  significantly  impaired. 

2.  Error  Signal  Characterization 

In  previous  studies,  two  methods  have  been  used  to  character- 
ize the  error — or  residual — signal  (the  difference  between  the  predicted 
and  actual  values).  In  the  first  method,  the  error  signal  is  character- 
ized at  each  time  sample  by  several  bits.  The  quantized  error  signal 
is  transmitted  and  used  to  drive  the  synthesizer  at  the  receiver.  A 
potential  advantage  of  this  approach  is  that  the  synthesis  procedure 
should  maintain  high  quality  performance  even  in  the  presence  of  audio 
background  noise.  The  major  disadvantage  is  that  the  bit  rate  required 
to  characterize  the  error  signal  is  high,  e.g.,  nominally  at  least  7200 
baud  for  a one-bit  quantizer. 

With  the  second  method,  the  error  signal's  features  are  ex- 
tracted, so  that  a much  lower  bit  rate  is  adequate  to  represent  the  error 
signal.  These  key  features  are  voiced/unvoiced  (V/UV)  decision,  pitch 
frequency,  and  power  level.  The  disadvantage  with  this  method  is  that, 
if  errors  are  made  in  the  feature-extraction  process,  serious  degradation 
of  performance  will  result.  Unfortunately,  these  errors  can  occur  rather 
easily  in  the  presence  of  common  disturbances,  such  as  audio  background 
noise,  phone-line  signal  distortion,  and  multiple  speakers. 
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Because  of  the  difficulties  in  these  methods,  a major  goal  in 
our  research  was  to  seek  alternative  encoding  or  characterization  tech- 
niques. Several  concepts  based  on  peak-picking,  threshold-crossing,  and 
extrema-encoding  were  proposed;  however,  a detailed  investigation  of 
these  techniques  was  not  possible  because  of  the  character  of  the  error 
the  signal  revealed  by  experimental  observations.  The  proposed  algorithms 
simply  would  not  function  reliably  with  the  observed  signals. 

This  result  was  not  anticipated  because  some  of  the  foremost 
researchers  in  LPC  methods  had  indicated  that  simple  peak-picking  was 
adequate  (Atal  and  Hanauer,  1971).“'  Our  experiments,  however,  showed 
conclusively  that  one  could  not  rely  on  the  presence  of  a readily  observ- 
able pitch  pulse  in  the  residual  signal.  In  fact,  the  residual  frequently 
was  highly  oscillatory  with  multiple  peaks  per  pitch  period.  This  situ- 
ation destroyed  the  purpose  and  the  advantages  of  the  proposed  algorithms 
for  error-signal  characterization. 

The  difficulty  of  the  encoding  problem  can  best  be  appreciated 
by  noting  that  the  residual  signal  is  extremely  intelligible.  In  fact, 
it  sounds  like  differentiated  speech.  Thus,  the  problem  of  encoding  the 
residual  signal  is  virtually  as  complex  as  the  problem  of  directly  en- 
coding the  input  speech. 

Since  this  result  was  so  surprising,  we  made  an  attempt  to 
determine  the  cause.  First,  we  tried  various  forms  of  analyses  (such  as 
pitch-synchronous  versus  pitch-asynchronous,  and  Toeplitz  versus  non- 
Toeplitz)  and  varying  numbers  of  coefficients.  The  most  desirable  residual 
signals  were  found  with  a pitch-synchronous  (over  one  pitch  period),  non- 
Toeplitz  analysis  or  with  a preemphasized,  Toeplitz  analysis  over  multiple 
pitch  periods.  Nevertheless,  even  in  these  cases,  highly  oscillatory 
residuals  were  frequently  observed.  Thus,  the  proposed  algorithms  would 
not  function  well  enough  for  any  of  the  conventional  LPC  approaches. 
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After  a literature  search  and  review  and  after  experiments 
with  synthetic  speech,  we  discovered  two  potential  difficulties.  First, 
the  use  of  a stationary  model  (fixed  predictive  coefficients  for  each 
analysis  block)  increases  the  energy  of  the  error  signal,  especially 
during  speech  sounds  with  changing  formant  frequencies  (vowel  glides, 
transitions  from  consonant  to  vowel,  and  the  like).  Hence  the  choice  of 
analysis  block  size  is  critical.  Second,  conventional  LPC  approaches 
model  the  glottal  excitation  shape  as  well  as  the  vocal  tract.  However, 
the  glottal  excitation  waveshape  cannot  be  modeled  accurately  by  poles 
(although  the  spectrum  can  be  approximated  quite  well).  As  a result,  the 
residual  signal  based  on  these  approximate  LPC  parameters  was  frequently 
quite  oscillatory  and  contained  a significant  amount  of  formant  information. 

On  the  basis  of  a theoretical  model  and  experiments  with  syn- 
thetic speech,  we  determined  that  the  true  vocal  tract  parameters  could 
be  found  by  performing  an  LPC  analysis  over  only  the  force-free  portion 
of  one  pitch  period.  Use  of  these  true  vocal  tract  parameters  in  the 
predictor  produces  the  glottal  excitation  waveshape  for  the  residual 
signal.  This  waveshape  lends  itself  naturally  to  the  proposed  encoding 
schemes  of  peak-picking,  threshold-crossing,  and  extrema-encoding. 

Thus,  the  research  indicated  that  the  use  of  the  proposed  con- 
cepts is  possible.  First,  however,  it  is  necessary  to  find  the  force- 
free  period  for  analysis.  This  problem  is  complex  but  fortunately  is  not 
quite  as  demanding  as  pitch  extraction.  Because  of  the  difficulty  of  the 
pitch-encoding  problem,  it  was  assigned  to  a separate  study  of  excitation 
encoding  (see  the  Task  3 report).  Meanwhile,  we  adopted  the  feature- 
extraction  approach  and  hand  placed  the  pitch  pulses.  With  this  approach, 
we  avoided  the  problems  of  algorithmic  pitch  extraction  and  could  con- 
centrate on  the  major  problem  of  asynchronous  operation. 

As  mentioned  before,  the  error-signal  characterization  (or 

pitch-extraction)  problem  is  extremely  difficult.  A separate  research 
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effort  (Task  3)  was  devoted  to  this  subject;  the  reader  is  referred  to 
the  Task  3 final  report  for  more  details  on  error— signal  characterization. 
In  this  Task  2 report,  sections  are  devoted  to  time— domain  pitch  extrac- 
t-ion (Section  III)  and  pitch-accuracy  requirements  (Section  IV). 

3 . Process  Modeling 

The  requirement  for  zeros  in  the  speech  process  model  was  de- 
termined to  result  from  the  following  factors: 

• Incorrect  analysis  time  base  with  respect  to  the  pitch 
period,  i.e.,  nonminimum  phase  waveforms  during  the 
analysis  interval. 

• Glottal  excitation  waveshape. 

• Nasals. 

• Other  sounds  with  side  cavities  or  branches  in  the 
acoustic  tract. 

We  determined  that  pole  approximations  to  the  zeros  required 
for  the  last  two  items  gave  adequate  performance  with  respect  to  syn- 
thesis, provided  that  the  pitch-extraction  problem  was  solved.  That  is, 
the  ear  is  relatively  insensitive  to  the  phase  of  the  synthesized  speech. 
However,  the  inability  to  produce  an  inverse  filter  that  correctly  de- 
convolves the  source  zeros  greatly  hampers  pitch  extraction  based  on  u. 
residual  signal.  Thus,  for  nasals  and  for  other  sounds  produced  with 
side  cavities  present,  the  need  for  zero  modeling  is  principally  associ- 
ated with  the  pitch-extraction  problem. 

To  model  the  excitation  waveshape  accurately,  many  zeros— 
perhaps  50— are  required  because  of  the  high  duty  factor  of  the  excitation. 
The  resulting  computational  problems  can  be  avoided  in  several  ways. 

First,  the  residual  can  be  heavily  filtered  and  the  sampling  rate  can  be 
reduced  so  that  fewer  zeros  suffice.  Second,  the  excitation  waveshape 
can  be  approximate!  by  a simple  waveform — e.g.,  a triangle — and  the 
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characteristic  parameters  can  simply  be  encoded  so  that  the  problem  of 
zeros  is  avoided  altogether.  Third,  two  LPC  analyses  can  be  performed. 

The  first  analysis  would  be  based  on  a selected  period  to  avoid  the 
excitation  function.  These  coefficients  would  be  used  to  pioduce  a re- 
sidual that  permitted  simple  pitch  extraction.  The  second  LPC  analysis 
would  be  based  on  one  or  more  pitch  periods  and  would  model  the  excitation 
waveshape  (by  approximating  it  with  poles)  as  well  as  the  vocal  tract 
tiansfer  function.  Thus,  this  second  set  of  LPC  parameters  could  be  used 
in  a synthesizer  driven  by  an  impulse  function.  In  this  case,  no  zero 
modeling  is  required.  Because  of  the  variety  in  glottal  waveshapes,  we 
recommend  the  use  of  the  third  approach. 

The  major  need  for  zero  modeling  was  determined  to  be  for 
phonemes  in  which  the  acoustic  channel  has  a side  branch  (nasals  included). 
Here,  the  major  goal  of  zero  modeling  is  to  produce  a residual  that  per- 
mits simple  pitch  extraction. 

Preliminary  efforts  were  directed  toward  methods  of  zero  deter- 
mination. Methods  based  on  solution  of  quadratic  equations  and  root- 
finding of  a polynomial  were  found  in  the  literature  (Gersh  and  Luo,  1972; 
Hsia  and  Landgrebe,  1967).  > An  adaptive  gradient  technique  that  avoids 
the  above  complex  operations  was  also  found  in  the  literature  (Melsa  et  al., 
1973). 5 

We  made  no  attempt  to  implement  any  of  the  zero-finding  algo- 
rithms in  the  Task  2 effort.  The  preliminary  need  for  zero  modeling 
was  determined  to  be  for  characterization  of  the  excitation  function. 

As  a result,  further  consideration  of  zero  modeling  was  left  for  Task  3. 

4.  Simplification  of  the  Gain  Calculation 

Adequate  synthesized  speech  quality  has  been  achieved  by  using 
a synthesizer  excitation  power  level  equal  to  the  residual  power  level. 
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Although  this  approach  does  not  guarantee  that  the  synthesizer  output 
power  will  match  the  input  signal  power  perfectly,  it  does  offer  a suffi- 
ciently good  approximation.  As  a result,  the  computational  load  is  sig- 
nificantly reduced  compared  with  the  original  estimation  by  Atal  and 
Hanauer  (1971).  Section  V of  this  report  presents  more  details  on  the 
gain  calculation  and  the  excitation  function. 

Comparison  of  Toeplitz  Versus  Non-Toeplitz  Form  Solutions 

L Both  the  Toeplitz  form  (Markel,  1972;  Itakura  and  Saito, 

I > • > 

1972)  > and  the  non-Toeplitz  form  (Atal  and  Hanauer,  1971)a  have  been 

implemented  on  the  PDP— 10  computer,  bach  can  be  operated  in  a variety 

of  modes  with  a user-selected  number  of  coefficients  and  block  size. 

Very  good  performance  has  been  demonstrated  with  both  forms.  On  the  basis 

of  the  testing  to  date,  it  appears  that  the  Toeplitz  form  is  preferable 

because  it  is  computationally  simpler,  particularly  with  respect  to 
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stability  determination.  However,  modest  differences  in  complexity  are 
probably  not  significant  for  future  systems  in  light  of  the  great  cap- 
ability of  LSI.  The  non-Toeplitz  form  appears  to  produce  somewhat  more 
desirable  residual  signals  for  pitch  extraction;  however,  it  does  not 
solve  the  pitch-extraction  problem  (see  Task  3 report).  The  Toeplitz 
approach  is  recommended  for  preliminary  real-time  demonstrations. 

6.  Innovations  Representation 

We  concluded  that  the  innovations  representation  of  a random 
process  offers  a more  generalized  viewpoint  that  may  provide  useful  in- 
sight for  some  speech-processing  problems.8  However,  it  is  much  more 
important  to  model  the  physical  process  accurately — e.g.,  to  include  zeros 
or  to  use  the  proper  number  of  coefficients— than  to  develop  sophisticated 
statistical  representations.  Consequently,  only  a modest  effort  was  de- 
voted to  the  innovations  approach. 
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Specifically,  the  energy  of  innovation  process,  i.e.,  the 
residual,  was  determined  to  be  an  extremely  useful  measure  of  the  quality 
of  the  parameter  estimation.  In  fact,  this  approach  led  to  the  DELCO 
compression  algorithm  discussed  in  Section  VI  and  Appendix  A and  briefly 
described  above.  However,  no  other  significant  contribution  to  computa- 
tional load  or  data  reduction  was  found  by  considering  the  innovations 
representation. 

C . Outline  of  the  Report 

The  preceding  section  briefly  summarized  our  research  results  by 
study  areas.  The  rest  of  this  report  presents  the  details  of  our  research. 
Quality  considerations  of  pitch-synchronous  analysis  and  synthesis  are 
considered  in  Section  II.  Section  III  discusses  time-domain  pitch  ex- 
traction. Pitch-accuracy  requirements  are  presented  in  Section  IV.  The 
LPC  synthesizer  excitation  function  recommendations  and  results  are  de- 
veloped in  Section  V.  The  adaptive  data  compression  system,  DELCO,  is 
discussed  in  Section  VI.  Section  VII  presents  our  conclusions. 


II  PITCH-SYNCHRONOUS  ANALYSIS  AND  SYNTHESIS  TECHNIQUES 


Conventional  linear  predictive  coding  algorithms  (Atal  and  Hanauer 


and  Market ) have  concentrated  on  methods  that  attempt  to  characterize 


not  only  the  vocal  tract  transfer  function  but  the  glottal  source  itself.2/ 


Thus,  the  synthesizing  filter,  when  driven  by  a series  of  impulse  func- 
tions at  the  pitch  rate,  attempts  to  reproduce  the  short-term  power  spec- 
trum of  the  speech.  Both  the  excitation  spectrum  and  the  vocal  tract 


power  transfer  functions  are  represented.  This  statement  holds  for  both 


the  non-Toeplitz  matrix  (Atal  and  Hanauer)  and  the  Toeplitz  matrix  (Market) 


solutions  to  the  problem. 


Makhoul  has  shown  that  the  formulation  of  the  Toeplitz-form  matrix 


equations  tends  to  estimate  the  peaks  of  the  spectr il  envelope  with  great 
accuracy,  while  the  nulls  or  dips  are  estimated  less  accurately.9  This 
performance  is  well  matched  to  human  perception.  Thus,  it  appears  that 
the  conventional  LPC  analysis  does  what  is  desired.  However,  the  above 
result  is  derived  on  the  assumption  of  white  noise  excitation  under  steady- 
state  circumstances  so  that  it  is  meaningful  to  discuss  power  spectra. 

In  practice,  only  a short  segment  of  speech,  perhaps  30  ms  at  most,  is 
analyzed.  Furthermore,  most  of  the  time,  the  excitation  is  not  white 
noise  but  rather  is  one  or  two  pitch  pulses,  or  possibly  several  for  a 
high-pitched  speaker.  Since  the  analysis  is  conventionally  performed  on 
a pitch-asynchronous  basis,  different  phasings  or  timings  of  the  excita- 
tion with  respect  to  the  analysis  interval  can  occur.  Thus,  depending 
on  this  timing,  somewhat  different  estimated  short-term  power  spectrum 
envelopes  may  result  from  the  analysis  when,  in  fact,  there  is  no  change 


in  the  power  spectrum. 
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The  best  solution  to  this  problem  is  to  increase  the  analysis  period 
so  that  more  excitation  pulses  are  present.  With  a sufficient  number  of 
pulses,  the  timing  of  the  pulses  with  respect  to  the  analysis  window  is 
not  crucial.  Furthermore,  the  concept  of  power  spectrum  becomes  more 
meaningful.  Unfortunately,  this  solution  hurts  the  transient  response 
of  the  analysis  system;  i.e.,  it  may  not  be  possible  to  track  rapid  tran- 
sients in  the  speech  spectra  with  the  larger  window. 

To  avoid  the  sluggish  time  response  of  large  window  analyses  and 
yet  avoid  timing- induced  distortion,  SRI  has  extensively  studied  pitch- 
synchronous  analysis.*  Nominally  we  used  a rectangular  window  over  one 
pitch  period  for  a Toeplitz-form  LPC  analysis  to  derive  the  LPC  coeffi- 
cients. However,  at  a later  point  in  our  research,  we  employ-d  a larger 
Hamming  window  over  three  r.itch  periods.  This  resulted  in  an  overlapped 
analysis,  since  we  performed  a new  analysis  each  pitch  period.  A Hamming 
window  was  not  used  unless  an  overlapped  analysis  was  employed.  Other- 
wise low  value  nulls  caused  by  the  window  might  have  suppressed  impor- 
tant data,  e.g.,  when  the  glottal  pulse  occurred  during  a window  null. 

An  advantage  of  pitch-synchronous  ui  alysis  is  that  pitch-synchronous 
synthesis  can  be  used  without  the  necessity  of  interpolation.  There  is 
considerable  debate  among  the  speech  community  about  the  necessity  for 
pitch-synchronous  synthesis.  However,  most  agree  that  the  synthetic 
speech  quality  is  not  degraded.  There  is  general  agreement,  too,  that 
if  interpolation  is  required  for  pitch-synchronous  synthesis,  one  must 
be  very  careful  about  the  interpolation  technique.  A poor  interpolation 
system  may  do  more  harm  than  good.  The  basic  problem  is  that  linear 
interpolation  of  LPC  parameters,  or  reflection  coefficients,  does  not 


The  subroutine  EPOCH,  which  sets  up  the  analysis  and  synthesis  from  the 
pitch  marks,  is  described  in  Appendix  B. 
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correspond  to  linear  interpolation  of  the  power  spectrum.  The  desired 
result  could  be  achieved  by  solving  for  the  poles  of  the  LPC  polynomials 
and  linearly  interpolating  these  poles.  Unfortunately,  this  is  a messy 
computational  procedure  requiring  something  like  a Newton-Raphson  root- 
finding technique.  Note  that  Market  has  obtained  quite  good  synthetic 
speech  simply  by  linearly  interpolating  the  reflection  coefficients.10 

The  advantages  of  the  pitch-synchronous  analysis  approach  are  that: 


• No  interpolation  of  parameters  is  necessary. 

• The  calculated  LPC  parameters  will  remain  constant  when 
the  speech  process  is  stationary. 

The  disadvantages  of  pitch-synchronous  analysis  are: 

• Variable  analysis  window  size,  which  causes  algorithm 
complexi ty. 

• Asynchronous  rate  of  generating  LPC  parameters,  which 
results  in  an  asynchronous  data  rate. 

• Higher  transmission  rates  when  parameters  are  encoded  each 
period,  a problem  particularly  for  high-pitched  speech. 

• Additional  analysis  system  complexity,  since  pitch  marks 
are  necessary  before  a pitch-synchronous  analysis  can  be 
performed . 

• Incompatibility  with  many  popular  pitch-extraction  tech- 
niques (e.g.,  autocorrelation)  that  provide  relative as 

opposed  to  absolute — pitch  marks.* 

• Incompatibility  with  LPC  techniques  that  do  not  use  pitch 
extraction,  such  as  RELP  (see  Task  3 report). 


With  overlapped  analyses,  the  use  of  a Hamming  window  makes  the  signifi- 
cance of  a window  of  precisely  three-pitch  periods  of  dubious  value. 
However,  there  may  be  some  value  in  having  the  window  always  in  the  same 
relative  position  with  respect  to  the  glottal  pulse. 
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We  performed  extensive  pitch-synchronous  analysis/synthesis  simula- 
tions and  demonstrated  that  very  high  quality  synthesis  is  achievable. 

The  output  is  virtually  indistinguishable  from  the  input  speech.  However, 
this  high  quality  was  achieved  at  the  price  of  the  disadvantages  listed 
above.  If  these  disadvantages  are  significant  enough  (as  it  now  appears), 
pitch-synchronous  analysis  will  not  be  used  for  practical  real-time  vo- 
coders. Nevertheless,  pitch-syncnronous  analysis/synthesis  serves  as  a 
useful  standard  of  quality  that  other  more  practical  systems  should  strive 
to  achieve.  With  good  pitch  extraction  and  excitation,  the  only  quality 
degradation  is  due  to  the  assumptions  of  the  LPC  speech  model  itself, 
e.g.,  no  zeros  appear  in  the  model. 

The  concept  of  pitch-synchronous  analysis/synthesis  is  critically 
dependent  on  precise,  absolute  pitch-mark  placement.  Time-domain  pitch 
extraction  is  briefly  described  in  Section  ill  of  this  volume  and  is 
described  in  considerably  more  detail  in  the  volume  devoted  to  Task  3. 

The  required  accuracy  of  pitch-pulse  placement  is  discussed  in  Section 
IV  of  this  volume. 

An  important  point  is  that  pitch  marks  are  placed  in  unvoiced 
intervals — during  the  aspiration  after  a stop  release,  for  example. 

Pitch  marks,  rather  than  periods,  are  stored  within  our  computer  simu- 
lation. Ihus,  pitch  is  considered  from  a time-domain  viewpoint  (i.e., 
the  excitation  required  to  produce  a given  waveform),  rather  than  from 
the  prosodic  viewpoint  of  speech  analysis  systems. 

The  excitation  system  is  generalized  with  respect  to  conventional 
approaches  to  include  a mixture  of  noise  and  pulses.  Section  V dis- 
cusses the  excitation  system  in  more  detail. 


J1L. 
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Ill  TIME-DOMAIN  PITCH  EXTRACTION 


Pitch  extraction  has  always  been  a fundamental  and  difficult  re- 
scaich  problem  of  speech  analysis,  in  general,  and  of  vocoder  design 
and  implementation  in  particular.  Linear  predictive  vocoder  techniques 
have  yielded  significant  improvement  in  vocal  tract  modeling  and,  hence, 
have  intensified  the  need  for  good  pitch  extraction.  The  first  sentences 
on  the  analog  tape  accompanying  this  report  demonstrate  the  good  quality 
achieved  by  an  LPC  vocoder  when  the  pitch  extraction  is  done  by  a human 
operator  using  a high  resolution  CRT  interactive  display.  The  sentences 
were  chosen  to  test  a range  of  difficult  speech  sounds  (such  as  nasals, 
vowel  glides,  and  semi-vowels)  and  are  typical  of  general  American  con- 
versational speech. 

Through  numerous  experiments  performed  with  interactive  hand  marking 
of  pitch  pulses,  we  have  pinpointed  several  requirements  that  a high 
quality  pitch  extractor  should  satisfy.  First,  pitch  marks  are  desirable 
for  some  aperiodic  speech  signals.  Good  examples  of  these  transients  are 
( 1 j stop  releases,  (2)  the  first  voiced  segment  in  a consonant-vowel  tran- 
sition, and  (3)  utterance-terminal  voiced  signals  with  low  amplitude 
and  vocal  fry,  i.e.,  erratic  pitch.  Second,  during  most  significant 
portions  of  speech,  the  pitch  estimates  should  vary  smoothly.  Based  on 
experience  with  our  data  base,  the  acceptable  rms  pitch  deviation  from 
the  true  pitch  is  approximately  ±2  Hz.*  "True”  pitch  is  defined  by  the 
hand-marked  pitch  pulses  that  produce  synthetic  speech  virtually  indis- 
tinguishable from  the  original. 


If  the  data 
the  required 
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The  lowest  pitch  of  our  data  base  is  approximately  100  Hz. 
base  were  expanded  to  include  a speaker  with  a 50-Hz  pitch, 
accuracy  is  expected  to  be  ±1  Hz. 


Computing  the  average  period  over  a large  window,  e.g.,  by  the 
autocorrelation  method,  may  satisfy  the  desired  smoothness  requirements. 
However,  it  may  not  accommodate  the  required  transient  situations  for 
high  quality  synthesis.  Indeed,  some  LPC  synthetic  speech  has  a monotone 
quality  when  based  on  a large  window  autocorrelation-function  pitch  ex- 
tractor. The  SIFT  algorithm  of  Markel  attempts  to  handle  these  transient 
situations  by  dividing  the  normally  large  window  into  subintervals, 
each  characterized  by  a particular  excitation  function  type.11  We  be- 
lieve this  artificial  approach  would  not  be  necessary  with  the  correct 
representation  of  pitch  pulses. 

In  contrast  to  the  compromises  inherent  in  correlation  pitch  ex- 
traction, we  believe  that  it  is  possible  to  obtain  superior  performance 
(at  the  price  of  increased  bit  rate  or  complexity,  or  both)  by  using 
time-domain  pitch  extraction.  Time-domain  techniques  are  capable,  in 
principle,  of  yielding  smooth  pitch  and  also  of  marking  transient,  periods 
accurately.  Time-domain  pitch  marking  is  described  more  completely  in 
Sections  II,  A,  3 and  II,  E of  the  Task  3 report.  Here  we  summarize  the 
basic  ideas  briefly.  Time-domain  pitch  marking  is  normally  done  in  two 
stages:  first,  locate  the  largest  magnitude  peak  in  a 2-  to  10-ms  window, 

and  second,  place  the  pitch  mark  at  some  repeatable  feature  of  the  wave- 
form near  the  large  peak.  The  repeatable  feature  could  be  (1)  the  zero 
crossing  preceding  the  peak,  (2)  the  peak  itself,  or  (3)  the  estimated 
point  of  transition  from  a decaying  to  a growing  signal.  In  general, 
interactive  hand  marking  of  pitch  make^  use  of  all  these  approaches. 

Each  result  is  tested  to  see  if  it  meets  the  smoothness  requirement.  If 
none  does,  it  is  necessary  to  use  a combination  of  the  above.  As  one 
might  suspect,  the  above  process  is  complex,  and  necessarily  so,  due  to 
the  wide  diversity  of  the  possible  speech  signals.  Consequently,  our 
experience  indicates  that  the  time-domain  approach  to  pitch  extraction 
is  not  well  suited  to  implementation  as  a real-time  automatic  algorithm. 
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Nevertheless,  it  is  extremely  useful  as  a laboratory  tool  and  provides 
a good  reference  for  the  best  achievable  performance  with  LPC  synthesis. 

The  complexity  of  time-domain  pitch  extraction  can  be  simplified  by 
performing  preprocessing  (filtering)  of  the  speech.  Three  basic  types 
of  filtering  may  be  employed:  (1)  inverse  filtering,  (2)  low-pass 

filtering,  and  (3)  f orman t — isolation  filtering.  Each  of  these  is  described 
in  greater  detail  in  the  Task  3 report.  Here  we  simply  summarize  the 
results  of  the  research  effort. 

In  general,  inverse  filtering  is  an  effective  method  of  reproducing 
the  glottal  waveshape  (and  thereby  simplifying  the  time-domain  pitch 
extraction).  Analysis  over  a 20-  to  25-ms  window  on  preeniphasized  speech 
is  necessary.  This  approach  encounters  difficulties  when  significant 
phase  distortion  (due  to  the  acoustic  environment,  for  example)  exists 
or  when  the  speech  character  is  rapidly  changing  so  that  the  window  is 
too  large  to  accurately  characterize  the  speech. 

Low-pass  filtering  the  speech,  e.g.,  to  a bandwidth  of  approximately 
600  Hz,  can  significantly  simplify  the  problem  of  pitch  extraction  in  the 
time  domain.  Unfortunately,  our  experience  has  been  that  pitch  marking 
on  this  baseband  signal  is  not  adequate  to  provide  the  desired  high 
quality  synthesis.  Nevertheless,  when  combined  with  other  information, 
the  results  can  be  useful  in  estimating  the  pitch-pulse  marks. 

Formant-isolation  filters  can  be  used  with  significant  performance 
improvement.  However,  the  complexity  of  this  system  is  prohibitive  for 

time  automatic  pitch  extraction.  Formant  isolation  when  combined 
with  low-pass  filtering  can  be  used  as  an  effective  method  of  hand  marking 
pitch  pulses.  It  should  be  noted  that  at  present  the  process  of  hand 
marking  pitch  pulses  can  be  greatly  shortened  by  using  the  formant- 
isolation  approach.  The  reader  is  referred  to  Section  II,  E of  the 
Task  3 report  for  more  details  on  this  subject. 
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IV  TIMING  REQUIREMENTS  FOR  HIGH  QUALITY  REPRODUCTION 


In  this  section  we  consider  the  timing  requirements  for  successful 
speech  reproduction.  An  accompanying  analog  tape  (see  Appendix  C for  a 
detailed  description)  illustrates  the  effects  described  here.  The  output 
is  from  our  LPC  vocoder  simulation  program  residing  in  the  SRI-AI  PDP-10 
computer  system.  In  all  cases,  the  input  speech  was  band-limited  to  4 kHz 
sampled  at  a 10-kHz  rate,  and  preemphasized,  i.e.,  one  point  differenced, 
in  software.  The  analysis  procedure  used  14  coefficients  and  applied  a 
Hamming  window  for  most  data.  However,  some  analysis  schemes  based  on 
one  pitch  period  used  a rectangular  window.  (All  the  utterances  on  the 
attached  tape  used  overlapped  analysis  with  a Hamming  window  and  pitch- 
synchronous  analysis/synthesis.)  The  synthesizing  filter  was  of  the 
lattice  type  described  by  Itakura.7  The  excitation  was  determined  by 
the  ratio  method  described  in  Section  V. 

The  following  subsections  study  the  effects  of  (1)  analysis  block 
(window)  length  and  (2)  pitch  accuracy. 

A.  Analysis  Window  Size 

The  first  set  of  utterances  analyzed  in  the  pitch— synchronous 
analysis/pitch-synchronous  synthesis  (PSA/PSS)  mode  used  a rectangular 
data  window  of  one  pitch  period.  A rectangular  time  window  does  not 
have  good  skirt  selectivity  in  the  frequency  domain.  Consequently,  the 
spectral  estimates  derived  from  such  an  LPC  analysis  are  only  approximate. 
One  method  of  alleviating  this  problem  is  to  use  a Hamming  window.  How- 
ever, without  overlapping,  significant  segments  of  data  may  be  missed 
due  to  window  nulls.  These  nulls  cannot  be  avoided  unless  overlapping 
and  a higher  analysis  block  refresh  rate  are  employed.  Of  course,  this 


Preceding  pege  blank 


19 


results  in  a higher  transmission  rate  unless  larger  windows  are  used. 
Normally,  larger  windows  are  used  and  an  increased  response  time  to  tran- 
sient effects  results. 

Experiments  were  performed  both  with  a rectangular  window  of  one 
pitch-period  duration  and  with  an  overlapped  Hamming  window  of  three 
pitch-period  duration,  with  a new  analysis  performed  each  pitch  period. 
These  tests  were  done  in  the  PSA/PSS  mode.  Very  good  quality  resulted 
in  both  cases.  Hc<  ever,  the  overlapped  analysis  approach  appeared  to  be 
less  sensitive  to  the  precision  of  pitch-pulse  marking.  For  very  low- 
pitched  speakers,  the  overlapped  approach  might  not  yield  a sufficiently 
good  transient  response  to  handle  very  rapidly  changing  speech  segments. 
For  our  data  base,  which  had  a lowest  pitch  of  approximately  100  Hz,  no 
problems  were  encountered.  Consequently,  on  the  basis  of  our  experiments, 
we  would  recommend  an  analysis  window  size  of  20  to  30  ms,  with  25  ms  a 
desired  goal. 

Use  of  pitch-asynchronous  analysis  over  a fixed  window  size  may  re- 
sult in  slight  quality  degradation.  However,  the  advantages  of  a fixed 
(rather  than  a pitch  variable)  window  size  are  significant,  in  a prac- 
tical sense.  As  a result,  we  recommend  a window  size  of  25  ms,  with  a 
new  set  of  coefficients  calculated  every  10  or  15  ms.  The  optimum  value 
must  be  determined  on  the  basis  of  extensive  testing  with  the  adaptive 
data-compression  algorithm  DELCO  (see  Section  VI  of  this  report). 

B.  Pitch  Accuracy  Requirements 

Conflicting  estimates  of  the  required  pitch  accuracy  are  given  in 
the  literature.  Gold  and  Rabiner  indicate  that  pitch  marks  must  be  placed 
within  100  ps  of  true  position.1'  Markel  places  his  requirements  in  the 
frequency  domain.11  In  describing  the  SIFT  algorithm  he  concludes  that 
the  fundamental  frequency  estimates  must  be  within  7 Hz  of  the  correct 


value.  For  a nominal  pitch  of  100  Hz,  this  corresponds  to  an  accuracy 
of  655  (is.  Thus,  a considerable  difference  exists  between  these  two 
estimates.  Consequently,  we  performed  several  experiments  on  our  data 
base  using  the  LPC  synthesizer  approach. 

Two  utterances  from  our  data  base  were  particularly  difficult  to  re- 
produce without  noticeable  roughness.  We  found,  by  iteratively  hand 
placing  pitch  marks,  that  it  was  possible  to  produce  a very  smooth  trace 
for  the  fundamental  frequency.  This  trace,  for  utterance  number  one,  is 
shown  as  the  solid  line  in  the  bottom  trace  of  Figure  1.  This  set  of 
pitch  marks  (called  DTG  in  our  file  notation  system)  was  taken  to  be  the 
true  or  best  estimate  of  the  pitch  function.  A real-time  algorithm  would 
have  great  difficulty  in  generating  a set  of  marks  as  good  because  of  the 
iterative  process  used. 

Two  additional  sets  of  pitch  marks  were  compared  with  the  best  or 
smooth  set  (file  DTG)  to  determine  if  it  is  possible  to  achieve  adequate 
quality  with  simpler  algorithms.  The  first  set  (called  DTM)  was  deter- 
mined from  the  unprocessed  speech  by  a simple  minimum-phase  criterion; 
that  is,  the  pitch  marks  were  placed  so  as  to  make  the  speech  signal 
appear  to  be  a minimum-phase  waveform  over  the  pitch  period,  i.e.,  a de- 
caying waveform.  This  pitch-marking  philosophy  was  adopted  since  it 
seemed  best  suited  to  the  basic  assumptions  of  the  LPC  approach.  The 
fundamental  frequency  estimates  based  on  this  hand-marked,  minimum-phase 
philosophy  are  shown  in  the  bottom  trace  of  Figure  1 as  the  series  of 
dots  scattered  about  the  solid  line  representing  the  best  hand-marked 
set  (file  DTG;.  The  middle  trace  shows  the  frequency  difference  between 
the  two  sets  of  pitch  contours.  The  top  trace  shows  the  envelope  of  the 
sentence. 

The  standard  deviation  for  the  period  differences  is  400  (is,  and 
the  standard  deviation  of  the  fundamental  frequency  differences  is  5.3  Hz. 
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The  quality  of  the  synthetic  speech  generated  on  the  basis  of  the  simpler, 
minimum-phase,  nonautomatic,  pitch-marking  algorithm  is  substantially 
worse  than  that  generated  on  the  basis  of  the  best  pitch  pulses.  The 
degradation  is  perceived  as  a roughness  in  the  synthetic  speech. 

A second  set  of  pitch  marks  (called  DTO)  was  generated  from  a 
bandpass-filtered  version  of  che  input  speech  using  formant  isolation 
filters  (see  Task  3 report).  The  pitch  marks  were  placed  at  zero- 
crossings  preceding  the  largest  peak  in  the  waveform  in  an  interval  cor- 
responding to  the  estimated  pitch  period.  An  attempt  was  made  to  smooth 
the  period  estimates  but  not  with  the  same  care  and  effort  as  were  used 
for  file  DTG. 

Figure  2 (same  format  as  Figure  1)  is  a photograph  of  a CRT  display 
comparing  the  DTG  and  DTO  files.  The  standard  deviation  for  the  period 
difference  is  200  ps,  and  the  standard  deviation  of  the  fundamental  fre- 
quency difference  is  2 Hz.  Perceptually,  the  two  sets  of  pitch  marks 
produce  indistinguishable  synthetic  speech. 

As  a result  of  these  and  other  experiments,  we  conclude  that  a pitch 
accuracy  of  2 Hz  (standard  deviation)  is  adequate  for  high  quality  syn- 
thesis. Poorer  accuracy  will  result  in  a perceptual  roughness  of  the 
synthetic  speech.  The  utterances  on  the  tape  compare  the  three  cases 
described  above.  The  reader  (listener)  may  judge  the  significance  of 
the  roughness  effect. 

The  tape  also  includes  two  additional  synthetic  speech  utterances 
that  were  generated  to  determine  whether  the  roughness  was  caused  by 
poor  analysis  windows  or  by  poor  accuracy  excitation.  The  first  synthetic 
speech  utterance  used  rough  (DTM)  pitch  marks  for  analysis  and  smooth 
(DTG)  pitch  marks  for  synthesis.  The  second  utterance  used  smooth  (DTG) 
pitch  marks  for  analysis  and  rough  (DTM)  oitch  marks  for  synthesis.  The 
reader  (listener)  can  readily  determine  that  no  quality  loss  results 
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FIGURE  2 OSCILLOSCOPE  TRACES  OF:  (A)  TOP  TRACE  - ENVELOPE  OF  SPEECH 

SIGNAL,  (B)  MIDDLE  TRACE  - FUNDAMENTAL  FREQUENCY  DIFFERENCE 
BETWEEN  PITCH  MARKS  IN  FILES  DTG  AND  DTO,  AND  (C)  BOTTOM 

S0L,D  LINE'  PITCH  cont°ur  FOR  THE  BEST  SET  OF  HAND- 
MARKED  PITCH  PULSES  (FILE  DTG)  AND  DOTTED  LINE,  PITCH  CONTOUR 
FOR  PITCH  MARKS  DERIVED  FROM  A SMOOTHED  ESTIMATE  OF  PITCH 
BASED  ON  THE  OUTPUT  OF  A LOW-PASS  FILTER  (FILE  DTO) 
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from  the  use  of  rough  analysis  pitch  marks.  However,  use  of  the  rough 
pitch  marks  for  the  excitation  function  results  in  synthetic  speech  with 
a rough  quality.  Thus,  we  conclude  that  the  roughness  results  from  the 


V LPC  SYNTHESIZE R EXCITATION 

Conventional  channel  vocoders  use  either  buzz  (pitch  pulses)  or 
hiss  (random  noise)  excitation  depending  on  whether  voiced  or  unvoiced 
synthesis  is  being  performed.  This  concept  has  been  extended  to  the 
original  LPC  analysis/synthesis  systems  as  well,  with  reasonably  good 
results . 


Part  of  our  research  effort  was  devoted  to  considering  improvements 
in  the  excitation  function.  The  most  obvious  modification  is  to  use  a 
mixture  of  noise  and  pulses  for  the  excitation.  From  a decision-theory 
point  of  view,  this  mixture  has  the  obvious  advantage  of  avoiding  cat- 
astrophic failures  when  a V/UV  error  is  made.  Instead,  the  "soft"  char- 
acter of  the  processing  (estimation  as  opposed  to  decision)  should  pro- 
vide graceful  degradation. 

Another  major  advantage  is  that  speech  does  not  consist  of  solely 
voiced  or  solely  unvoiced  segments.  Perhaps  the  best  known  example  of 
a different  segment  is  the  voiced  fricative.  Here  the  excitation  is  a 
composite  of  noise  (due  to  turbulence  produced  by  a constriction)  and 
pulses  (due  to  the  action  of  the  vocal  cords).  Other  lesser  known  cases 
exist.  For  example,  Fujimura  found  that  many  voiced  sounds  contain  un- 
voiced power  in  certain  portions  of  the  frequency  spectrum,13  Thus,  a 
mixture  of  noise  and  pulse  excitation  appears  to  provide  a better  approxi- 
mation to  the  true  excitation  source. 

We  have  developed  an  excitation  function  that  is  just  such  a mix- 
ture of  random  noise  and  pulses.  The  ratio  of  noise  to  pulse  power  is 
controlled  by  the  normalized  error  or  residual  energy,  ERRN.  The  reason- 
ing is  that  voiced  processes  are  more  predictable  than  unvoiced  processes. 
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Consequently,  the  normalized  (the  normalization  is  required  since  voiced 
signals  generally  have  much  higher  power  than  unvoiced  signals)  error 
energy  lor  voiced  signals  should  be  much  less.  Atal  and  Hanauer  confirm 
that  this  is  a valid  approach. k 

Many  relationships  between  the  ratio  of  the  noise  and  pulse  powers 
(1UII0)  and  ERRN  have  been  tried.  Through  our  experiments,  we  have 
found  that  the  following  characteristic  (shown  in  Figure  3)  provides  the 
optimum  performance.  The  ratio  of  the  noise  energy,  to  the  sum  of 

the  pitch  pulse  plus  noise  energies,  + E^,  is  defined  as  the  variable 

RATIO  = E /(F.  f E ) 
n n p 

Below  a value  of  ERRN'  = 0.250, 

RATIO  = 16  (ERRN)2 


For  ERRN  ^ 0.250, 


RATIO  = 1.0 


That  is,  if  the  normalized  error  energy  exceeds  0.250,  only  hiss  ex- 
citation is  used.  For  smaller  values  of  ERRN,  the  excitation  rapidly 
converges  to  consist  primarily  of  pulse  energy. 

The  excitation  requires  the  information  given  by  RATIO  plus  the 
residual  energy,  E = ERRN  • R^-where  Rq  is  the  input  signal  power  over 
the  analysis  window.  With  this  information  the  proper  absolute  energy 

can  be  applied  to  each  source.  Note  that  our  excitation  power  formula 
is  based  on  ERRN  where 
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FIGURE  3 RATIO  OF  NOISE  ENERGY  TO  SUM  OF  NOISE  AND  PULSE 
ENERGIES  AS  A FUNCTION  OF  ERRN 


ERRN 
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This  value  corresponds  to  the  true  normalized  residual  energy  for  a 
non-Toeplitz-form  LPC  analysis.  However,  for  the  Toeplitz— form  analysis 
that  we  conventionally  use,  ERRN  is  only  an  approximation  to  the  correct 
value.  Fortunately,  for  our  size  analysis  window  and  for  the  number  of 
LPC  coefficients  (14  or  fewer),  the  approximation  is  quite  good  and  high 
quality  synthesis  results. 


A more  serious  approximation  exists.  The  above  excitation  philosophy 
is  based  on  the  assumption  that,  if  we  match  the  excitation  power,  the 
output  power  will  match  the  power  of  the  input  speech.  Unfortunately, 
this  result  does  not  hold  perfectly  because  of  coherent,  transient  can- 
cellation effects  when  the  synthesizing  filter  coefficients  are  updated. 
That  is,  the  decay  response  of  the  initial  conditions  left  over  from  the 
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previous  analysis  period  can  coherently  add  or  subtract  from  the  present 
interval.  Fortunately,  the  magnitude  of  the  initial  condition  response 
is  normally  quite  small  compared  with  the  impulse  response.  Neverthe- 
less, the  result  is  that  the  envelope  function  of  the  synthetic  speech 
is  considerably  more  jagged  than  the  input  speech.  In  fact,  the  dynamic 
range  of  the  synthetic  speech  may  be  four  times  greater  than  the  input 
speech. 

The  above  effect  leads  to  a certain  harshness  (perceptible  under 
ideal  listening  conditions)  in  the  synthetic  speech.  However,  the  pro- 
cedure for  resolving  this  minor  problem  is  computationally  complex;  Atal 
and  Hanauer  describe  this  method  of  guaranteeing  a power  match  between 
the  input  and  synthetic  speech  processes.'  Our  conclusion  is  that  the 
quality  improvement  is  not  worth  the  additional  system  complexity. 

Our  experiments  with  the  excitation  mixture  concept  indicate  that 
very  high  quality  synthesis  can  be  achieved.  In  fact,  the  resulting 
synthetic  speech  is  virtually  indistinguishable  from  the  input  speech. 
Furthermore,  the  excitation  mixture  system  appears  more  robust  with  re- 
spect to  other  system  degradations.  For  example,  some  evidence  exists 
that  the  presence  of  noise  in  the  excitation  signal  tends  to  mask  the 
roughness  associated  with  pitch-asynchronous  synthesis.  As  a result, 
we  recommend  the  use  of  the  excitation  mixture  concept. 


VI  ASYNCHRONOUS  TRANSMISSION  OF  LPC  PARAMETERS 


A.  Introduction 

Speech  is  an  inherently  asynchronous  time-varying  process.  The 
properties  of  the  signal  vary  with  the  short-term  properties  of  the  par- 
ticular utterance.  It  is  well  known  that  for  various  reasons  speech  con- 
tains pauses  ranging  in  duration  from  a few  milliseconds  to  several  sec- 
onds. Similarly,  we  find  that  quasi-stationary  portions  of  voiced  speech 
over  several  excitation  periods,  e.g.,  over  approximately  80  ms,  are  not 
uncommon.  In  contrast,  we  also  find  significant  signal  character  changes 
occurring  in  one  or  two  excitation  periods.  A characteristic  of  an  adap- 
tive speech  compression  system  designed  for  asynchronous  operation  is  a 
nonuniform  data  transmission  rate  commensurate  with  the  varying  proper- 
ties of  the  input  signal.  An  advantage  over  similar  synchronous  systems 
is  the  retention  of  a given  quality  of  synthetic  speech  at  a lower  average 
bit  transmission  rate.  An  asynchronous  system  interfaces  nicely  with 
asynchronous-transmission  circuits,  such  as  those  employing  packet-switching 
techniques.  The  interface  to  an  ordinary  synchronous  circuit  requires 
data  buffering  to  achieve  the  uniform  transmission  rate. 

In  this  section  we  seek  a measure,  6,  of  the  change  of  signal  proper- 
ties in  speech  from  one  analysis  frame  to  another.  The  transmission 
strategy  is  then  to  transmit  new  LPC  parameters  to  the  synthesizer  only 
when  6 (the  change  between  the  previously  transmitted  frame  and  the 
current  frame)  exceeds  a predetermined  threshold.  Four  candidate  measures 
(6 i > 62;  63>  and  6^)  are  defined,  discussed,  and  evaluated.  Experimen- 
tal results  are  presented  showing  that  the  adaptive  LPC  transmission  algo- 
rithm based  on  6^  yields  at  50  percent  to  70  percent  reduction  in  bit 
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rate  with  negligible  loss  in  speech  quality.  These  results  are  found  for 
many  speakers,  utterances,  and  types  of  LPC  analysis.  Statistical  re- 
sults describing  the  time  between  coefficient  updates  and  the  time  be- 
tween transmission  of  successive  packets  in  a typical  packet  communication 
system  are  presented  and  discussed. 

Appendix  A describes  how  a particular  adaptive  compression  algorithm 
(DELCO)  that  was  developed  in  this  research  effort  can  be  interfaced  to 
a packet  communication  system. 

B.  The  LPC  Model 

In  most  formulations  LPC  coefficients  are  used  to  model  the  combined 
effects  of  the  glottal  source,  the  vocal  tract  shape,  and  radiation 
characteristics.  At  a particular  instant  in  time  a speech  sample,  s(n), 
is  approximated  by  a linearly  weighted  summation  of  the  past  p samples. 
That  is, 

P 

s(n)  ^ j a(  i)  • s(n  - i) 
i=l 

The  prediction  error  (or  residual)  is  given  by 

e(n)  = s(n)  - s(n) 

and  the  linear  predictive  coefficients  are  found  by  minimizing  the 
squared  error  summed  over  a given  duration.  The  result  is  a set  of  p 
linear  equations  in  terms  of  the  autocorrelation  coefficients.  Depending 
on  the  precise  formulation,  the  matrix  of  autocorrelation  coefficients 
may  be  Toeplitz  or  non-Toeplitz  in  form.  The  impact  of  this  difference 
is  not  great.  (There  are  some  complexity  reductions  for  the  Toeplitz- 
form  case.)  In  either  case  the  residual  energy  is  given  by 
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Both  Tooplitz  and  non-Toeplitz  analysis  assume  that  the  speech 
process  is  stationary  over  short  intervals  (approximately  10  to  20  ms). 
Thus,  the  LPC  model  assumes  a piece-wise  stationary  process.  In  addi- 
tion, the  LPC  model  assumes  that  the  speech  process  can  be  adequately 
modeled  by  an  all-pole  (or  autoregressive)  source.  To  date  there  is  no 
indication  that  the  quality  of  the  reconstructed  speech  is  deteriorated 
by  either  of  these  assumptions. 


Atal  prefers  to  view  the  LPC  analysis  from  the  time-domain  view- 
point.17 However  one  can  regard  the  LPC  approach  from  the  frequency 
domain  equally  well.  In  fact,  Makhoul  has  shown  that  the  Toeplitz  form 
of  LPC  analysis  matches  the  peaks  of  the  envelope  of  the  short-term 
power  spectrum.3 

For  the  purpose  of  evaluating  the  performance  of  algorithms  for 
the  adaptive  transmission  of  LPC  coefficients,  we  extensively  use 
frequency-domain  techniques,  such  as  the  short-term  power  spectra  de- 
rived from  LPCs  [see  Figures  5 through  10  and  the  graphs  of  frequency 
(formant)  peaks,  Figures  11  through  18].  Listening  tests  verify  that 
preserving  spectral  properties  gives  good  quality  reproduction. 


} 


c. 


Description  oi'  Adaptive  Measures 


The  problem  is  to  determine  a measure  of  the  amount  of  change  in 
vocal  tract  parameters  from  one  analysis  frame  to  another  and  then  to 
use  this  information  as  a means  of  adaptively  transmitting  the  LPC  co- 
ef ficients  at  a reduced  transmission  rate.  Four  measures  are  examined. 

For  each  measure  a function,  6,  is  defined  whose  value  is  used  to  indicate 
the  relative  amount  of  change  in  coefficients  between  two  analysis  frames. 
A low  value  of  6 should  indicate  similar  vocal  tract  parameters  over  the 
two  frames.  A high  value  of  6 should  indicate  that  the  vocal  tract 
parameters  for  the  two  frames  are  substantially  different.  The  first 
three  measures  ( 6^,  b^,  and  6 ) are  computed  directly  from  the  LPC  co- 
efficients. These  functions  reflect  various  assumptions  about  the  rela- 
tionship between  changes  in  vocal  tract  parameters  and  the  changes  in 

LPC  coefficients.  The  fourth  measure  6 considers  the  normalized  resid- 

4 

ual  energy  over  the  nth  analysis  epoch  using  nonoptimum  versus  optimum 
coefficients.  Although  somewhat  more  computationally  complex  than  6 , 

6 , and  b , 6 is  based  on  the  normalized  residual  energy  and  is  consis- 
tent  with  the  theoretical  analysis  of  Magill.15;16 

1.  Adaptive  Measures  Based  on  the  LPC  Parameters 
or  Transformed  Versions  of  Them 

Although  the  measures  about  to  be  described  may  operate  on  the 
coefficients  a(i),  the  same  measures  may  operate  on  the  reflection  co- 
efficients or  partial  correlation  (PARCOR)  coefficients,  k(i),  of  Itakura 
and  Saito.  For  reasons  that  will  become  evident  as  the  discussion  pro- 
ceeds, ve  use  the  reflection  coefficients,  k(i).  Note  that  in  the  follow- 
ing definitions,  6 , 6 , and  6 are  not  necessarily  based  on  all  of  the 

\.  ct  o 

k(i).  Only  the  first  few  may  be  used. 

Consider  a q-dimensional  subset  of  the  coefficients  k(i)  as  a 
discretely  varying  q-tuple  of  real  numbers,  K,  on  the  inner  product 
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space,  H,  of  dimension  q.  The  canonical  inner  product  is  defined  to 
be  < U,  V >=  u(l)v(l)  + u(  2)  v(  2)  + . . . + u(q)v(q).  The  length  of  U is 
defined  as  |u|  = \J<  U,  U >.  Let  the  superscripts  m and  n respectively 
denote  the  mth  and  nth  analysis  frames  with  m < n. 

Measure  1 is  simply  the  distance  between  the  q-tuples  K*1  and 

m n m 

K , i.e.,  the  length  of  K - K . For  Measure  1, 


c in  mi 

\ = Ik  - K | 


E [kn(i)  - Ad]2 


1/2 


i=l 


Measure  2 is  the  length  of  K*1  - K™  normalized  (divided)  by  the 


length  of  K , 


n m 

6 = IK  - K | J i=l 

2 " 1^1 


M 

5 [k"(i>  - km<i)]2 


1/2 


[kn(i)j 


i=l 


Measure  3 is  the  length  of  K - K where  each  component  of 

n m 

K - K is  scaled  by  a factor  inversely  proportional  to  the  magnitude  of 


each  component  of  K , 


1/2 


6 = 


E tkn(i)  - km(i)]2 


i=l 


n 

k (i) 


2.  Theoretically  Optimum  Adaptive  Measure 

Let  the  superscripts  m and  n respectively  denote  the  mth  and 
nth  analysis  frames  with  m < n,  as  in  the  previous  section. 
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Measure  4 derives  the  function  6 from  a comparison  of  the 
normalized  error  energy  over  the  nth  analysis  frame  with  the  normalized 

error  energy  over  the  same  analysis  frame  using  the  coefficients  from 

* m 

the  mth  frame,  denoted  by  the  vector  A 

64  = 1 - En(An)/En(Am) 


E (A  ) is  the  error  energy  using  optimum  coefficients  for  the  nth  frame, 

P 

En(An)  = Rn(0,0)  - ^ an(i)-  Rn(0,i) 

i=l 

n m 

and  E (A  ) is  the  error  energy  using  nonoptimum  coefficients  for  the  nth 
frame, 

P 

n m n m n 

E (A  ) = R (0,0)  - 2 2^  a (i)-  R (0,i) 

i=l 

+ ^ 1 j am( i) •am(X) • Rn(i,^) 
i=l  1=1 

D.  Transmission  Strategy 

For  each  measure,  we  hypothesize  that  a low  value  of  6 will  indicate 
similar  vocal  tract  parameters  over  both  the  mth  and  the  nth  frames  and 
that  a higher  value  of  6 will  indicate  different  vocal  tract  parameters 
for  the  nth  frame,  compared  with  the  mth  frame.  The  transmission  strategy 
is  to  send  coefficients  when  6 exceeds  a given  threshold,  y,  where  the 


* 

Note  that  this  measure  is  simply  a transformed  version  of  the  measure 
used  in  Appendix  A.  Here  64  is  constrained  to  lie  between  0 and  1, 
whereas  in  Appendix  A the  equivalent  parameter  DEL  lies  between  1 and  60 . 
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mth  frame  corresponds  to  the  last  transmitted  coefficients  and  the  nth 
frame  is  the  current  frame. 


A typical  synthetic  speech  waveform  resulting  from  the  above  trans- 
mission strategy  is  shown  in  Figure  4.  This  figure  presents  three  sep- 
arate waveforms.  The  top  trace  is  the  envelope  function  associated  with 
the  utterance,  "Add  the  sum  to  the  product  of  these  three."  The  two 
marks  corresponding  to  the  speech  segment  (duct)  represent  the  interval 
that  is  shown  in  greater  detail  in  the  lower  traces.  The  middle  trace 
is  the  synthetic  speech  for  this  interval;  the  lower  trace  is  the  input 
speech  during  this  interval.  Note  how  the  peaks  of  the  input  speech 
vary  smoothly  with  time.  By  contrast,  the  synthetic  speech  peaks  tend 
to  follow  a step-function-like  contour.  This  is  the  case  since  the  ex- 
citation power  level  is  updated  only  when  the  LPC  parameters  are  updated. 
Thus,  by  observing  the  middle  trace  one  can  see  that  new  LPC  parameters 
were  transmitted  at  approximately  1.522,  1.535,  1.556,  and  1.582  s.  The 
Sv-ep-like  character  of  the  envelope  of  the  synthetic  speech  appears  to 
the  eye  to  be  a significant  distortion.  Fortunately,  it  is  virtually 
imperceptible  to  the  human  ear.  Consequently,  no  attempt  has  been  made 
to  update  the  excitation  power  levels  more  frequently. 


E-  Empirical  Evaluation  of  Coefficient  Measures 


A typical  test  case  is  shown  in  Figures  5 through  10.  In  all  cases 
the  lower  trace,  which  shows  the  input  speech,  is  the  same.  The  upper 
trace,  which  differs  from  figure  to  figure,  shows  the  power  spectrum 
computed  from  different  speech  segments.  The  speech  sample  (lower  trace) 
is  the  syllable  "Pete,"  minus  the  initial  stop  release,  taken  from  the 
utterance,  "Pete  Cooper's  dog  toyed  with  Dick  Todd’s  cat."  The  LPC 
power  spectra  (upper  trace)  were  computed  by  taking  the  reciprocal  of  the 
log  magnitude  spectrum  of  the  inverse  filter.  The  basic  technique  is 
described  by  Markel.6  For  the  test  cases,  LPC  coefficients  were  computed 
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FIGURE  10  LPC  SPECTRA  OVER  THE  SYLLABLE  "PETE" 
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using  analysis  systems  denoted  as  either  PTOVR  (pitch-synchronous  analysis 
using  overlapped  analysis  frames,  p = 14,  with  a Hamming  window  applied) 
or  PTSYN  (pitch-synchronous  analysis,  no  overlap  of  analysis  frames, 
p = 14,  no  Hamming  window  applied).  The  spectra  were  computed  for  128 
spectral  points  over  the  indicated  marks  with  the  vertical  scale  in 
decibels  and  the  horizontal  scale  going  to  5 kHz. 

Note  that  the  spectra  of  intervals  1-2  versus  2-3  (Figure  5)  show 
transitions  both  in  frequency  (about  5 kHz/s  for  second  and  third  for- 
mants) and  in  amplitude.  For  intervals  2-3  and  3-4  (Figure  6)  the  same 
is  true  but  to  a lesser  extent.  Commencing  with  interval  3-4,  we  experi- 
ence a relatively  slowly  changing  spectrum.  Figures  7 through  10  show 
that  a significant  portion  of  the  speech  sample,  say  3-9  or  perhaps  3-10, 
may  be  approximated  as  being  quasi-stationary . Aided  by  this  information, 
we  may  compare  results  obtained  using  the  various  coefficient  measures. 

Tables  1 and  2 briefly  summarize  results  obtained  for  the  syllable 
Pete  using  PTSYN  and  PTOVR  analyses  for  various  threshold  values.  In 
each  case  the  nth  frame  is  the  current  frame  and  the  mth  frame  is  the 
frame  at  which  time  the  coefficients  were  last  transmitted.  If  the 
threshold  is  exceeded,  we  transmit  coefficients;  otherwise  the  previous 
coefficients  are  used.  For  example,  the  6^  measure  with  PTSYN  analysis, 

Y = 0.4,  transmitted  a new  set  of  coefficients  at  187.9  ms  (Index  1) 
and  at  195.0  ms  (Index  2).  The  set  at  195.0  ms  was  used  over  the  next 
four  periods  and  then  subsequently  refreshed  by  a new  set  at  230.3  ms. 

The  table  dramatically  points  out  the  poor  performance  of  some  of  the 
measures,  e.g.,  Measure  2.  PTSYN,  q = 4,  y = 0.4  and  Measure  1,  PTSYN, 
q = 4,  Y = 0.3.  These  measures  transmit  coefficients  during  the  quasi- 
stationary  portion  of  the  signal  when  it  is  unnecessary. 

Another  performance  evaluation  is  obtained  by  comparing  the  function 
6 given  Y and  the  resultant  transmission  decisions  with  the  graphs  of 
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last  transmitted  coefficient  set 


Table 


transmitted  coefficient  set 


frequency  (formant)  peaks.  For  visual  representation  of  transmission 

* 

decisions,  the  function  6 is  defined: 

* 

6 = 6 if  6 £ y (transmit  coefficients) 

* 

6 = 6 - 1.0  if  6 < y (do  not  transmit  coefficients)  (biased 

downward) 

Figures  11  through  18  show  the  decision  process  for  two  different  utter- 
ances by  different  speakers  using  various  analysis  techniques.  The  upper- 
most solid  trace  is  the  rms  power  envelope  of  the  speech  signal.  Immedi- 
ately below  it  are  three  formant  traces,  frequency  (each  division  corre- 
sponds to  250  Hz)  versus  time,  which  mark  the  location  of  the  frequency 
peaks  in  the  LPC  power  spectra  computed  at  10-ms  intervals  over  an  analysis 
window  of  15  ms  (overlapped  analysis).  The  lower  trace  is  a plot  of  6* 
versus  time  as  computed  for  the  particular  run.  When  the  formant  traces 
remain  constant,  we  expect  to  see  a small  number  of  occurrences  where 
the  decision  is  to  change  coefficients  (6  biased  downward).  When  the 
formant  traces  are  changing,  we  expect  to  see  a large  number  of  occur- 
rences where  the  decision  is  to  change  coefficients  (6  remains  unbiased). 


F . Results 

Of  the  four  coefficient  measures,  best  results  are  obtained  using 

Measure  4 (the  measure  based  on  the  residual  signal  energy).  Extensive 

listening  tests  verify  that  high  performance  is  maintained  for  male  and 

female  speakers  over  several  utterances  using  both  pitch-synchronous  and 

pitch-asynchronous  analysis.  Using  overlapping  analysis  frames  introduces 

redundant  data  in  the  overlap  period  and  produces  a smoother  6 function. 

4 

This  allows  for  more  accurate  extraction  of  the  changes  in  the  vocal 
tract  parameters  than  is  obtained  using  nonoverlapped  analysis  frames. 
However,  with  the  transmission  strategy  previously  described,  the 
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FIGURE  12  TRANSMISSION  DECISION  FUNCTION  5*  FOR  MEASURE  4,  Y = 0.35 
Analysis  type;  PTOVR  (pitch  synchronous  overlapped) 
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FIGURE  13  TRANSMISSION  DECISION  FUNCTION  5*  FOR  MEASURE  4,  Y = 0.25 
Analysis  type:  PTOVR  without  pre-emphasis 
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FIGURE  14  TRANSMISSION  DECISION  FUNCTION  5*  FOR  MEASURE  4,  Y = 0.3 

Analysis  type:  Block  synchronous  using  overlapped  25-ms  analysis  frames 

shifted  at  15-ms 
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FIGURE  16  TRANSMISSION  DECISION  FUNCTION  6*  FOR  MEASURE  3,  V = 0.5 
Analysis  type:  PTOVR  (pitch  synchronous  overlapped) 
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FIGURE  17  TRANSMISSION  DECISION  FUNCTION  6*  FOR  MEASURE  3,  Y = 0.5 
Analysis  type;  PTOVR  (pitch  synchronous  overlapped) 
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FIGURE  18  TRANSMISSION  DECISION  FUNCTION  6*  FOR  MEASURE  4,  V = 0.3 
Analysis  type:  PTOVR  {pitch  synchronous  overlapped) 
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discernible  difference  in  performance  for  overlapped  versus  nonoverlapped 

analysis  is  small.  Generally,  better  speech  quality  is  obtained  for 

pitch-synchronous  than  for  pitch-asynchronous  analysis.  However,  this 

same  result  was  observed  without  adaptive  compression.  Comparison  of 
* 

64  functions  for  pitch-synchronous  analysis  is  not  important  for  adaptive 
* 

compression. 

Measure  3 based  directly  on  the  coefficients  k ( i ) ] produces  a 
slightly  higher  average  bit  rate  for  a given  quality  of  rynthetic  speech 
than  that  obtained  using  Measure  4.  This  measure  (and  Measures  1 and  2) 
performs  best  when  applied  over  only  the  first  few  k(i),  using  over- 
lapped analysis  frames.  The  results  indicating  better  performance  with 
Measures  1,  2,  and  3 applied  over  only  a few  of  the  k(i),  e.g.,  q=4,  are 
not  expected.  However,  letting  q=4  eliminates  the  noise"  introduced 
into  the  computation  of  5 by  the  higher-order  terms,  which  are  rot  as 
accurate  as  the  lower-order  terms.  By  contrast,  Measure  4 does  not  re- 
quire overlapped  analysis  for  acceptable  performance.  For  this  reason, 
Measures  1 and  2 are  dropped  entirely.  Although  Measure  4 is  more  robust 
and  theoretically  more  justifiable  than  Measure  3,  the  latter  demonstrates 
that  tolerable  transmission  decisions  may  be  extracted  directly  from  the 
k(i ) . 

Typical  rates  of  compression  obtained  over  and  above  the  data  trans- 
mission rates  obtained  from  synchronous  LPC  systems  are  summarized  in 
Tables  3,  4 and  5.  The  baud  rates  are  computed  on  the  basis  of  72  bits 
per  transmitted  frame  with  14  coefficients  quantized  at  6,  6,  4,  4,  4 ... 
bits  respectively  for  a total  of  60  bits.  Twelve  additional  bits  are 
provided  for  excitation  amplitude,  pulse/noise  ratio,  and  pitch  infor- 
mation. Since  the  pitch  frequency  for  these  tests  is  derived  from 


We  hypothesize  that  the  pitch-asynchronous  degradation  is  associated  with 
imperfect  gain  settings  in  the  excitation  function  and  not  with  the  LPC 
parameters  themselves. 
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hand-marked  pitch  pulses,  it  is  unquantized;  therefore,  the  given  baud 
rates  are  estimated  baud  rates.  The  results  show  that  adaptive  trans- 
mission of  LPC  parameters  allows  an  impressive  reduction  in  average  bit 
transmission  rate. 

All  four  of  tlie  transmission  measures  described  above  attempt  to 

respond  only  to  spectral  changes  and,  of  course,  are  derived  from  values 

* 

that  are  normalized  with  respect  to  signal  power.  Since,  in  general,  the 
gross  spectral  properties  cannot  be  expected  to  remain  constant  during 
pauses  in  the  speech,  some  unnecessary  transmission  of  LPC  parameters 
may  occur  during  pauses.  This  problem  is  clearly  evident  in  Figures  17 
and  18.  The  algorithm  should  therefore  be  augmented  with  a signal 
present/absent  detector. 

G . Transmission  Statistics 

The  time  between  coefficient  updates  using  the  asynchronous  strategy 

(pitch-synchronous  analysis)  varies  from  one  to  several  pitch  periods. 

The  minimum,  maximum,  average,  and  standard  deviation  of  the  time  between 

coefficient  updates  for  several  speech  utterances  by  a variety  of  speakers 

are  given  in  Table  6.  The  table  shows  that,  for  a typical  speaker  with 

V = 0.3,  an  average  time  between  coefficient  updates  of  approximately 

30  ms  can  be  expected,  with  a standard  deviation  of  about  20  ms — although 

minimum  and  maximum  times  between  coefficients  can  be  expected  to  range 
t 

from  3 to  200  ms. 

I 

(i 



* 

Theoretically,  only  Measure  4 can  be  clearly  tied  to  spectral  changes, 

In  general,  measures  based  on  the  reflection  coefficients  or  the  LPC 
parameters  are  not  reliable  since  the  transformation  between  them  and 
the  power  spectrum  is  not  metric  preserving, 
t 

the  minimum  value  of  3 ms  results  for  pitch-synchronous  analysis  with 
a female  speaker  of  approximately  300  Hz  pitch.  More  realistically  for 
PAA,  *he  minimum  value  is  10  ms.  1 
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If  we  assume  that  no  special  buffering  or  smoothing  of  the  data  takes 
place,  the  effect  of  the  asynchronous  data  rate  on  a packet  transmission 
system  will  be  to  produce  a corresponding  asynchronous  packet  transmission 
rate.  Table  7,  which  uses  the  same  speech  utterances  and  speakers  as 
Table  6,  presents  statistical  data  with  respect  to  the  time  between  packet 
transmissions.  For  360  data  bits/packet  and  a typical  speaker  with  Y = 0.3, 
an  average  time  between  packet  transmissions  of  approximately  160  ms  can 
be  expected.  The  standard  deviation  is  about  40  ms.  Minimum  and  maximum 
times  between  packet  transmissions  can  be  expected  to  range  from  40  to 
slightly  over  250  ms.  Similar  conclusions  may  be  derived  from  the  T 
data  bits/packet  statistics.  I t is  worth  noting  that  a practical  packet 
transmission  system  will  require  the  maximum  time  between  packet  trans- 
mission to  be  limited. 


VII  CONCLUSIONS 


Based  on  our  simulation  results,  reconstructed  speech  quality  appears 
not  to  depend  on  whether  the  LPC  analysis  is  of  the  Toeplitz  or  the  non- 
Toeplitz  type.  Other  factors,  such  as  pitch  extraction,  have  a much 
greater  bearing  on  the  speech  quality.  The  advantage  of  the  Toeplitz 
analysis  is  that  the  computed  reflection  coefficients  are  guaranteed  to 
produce  a stable  synthesizing  filter.  Consequently,  our  major  research 
effort  concentrated  on  Toeplitz- form  LPC  analysis/synthesis  systems. 

Our  research  demonstrated  that  the  best  quality  synthetic  speech 
resulted  when  pitch-synchronous  analysis  and  synthesis  were  performed. 

The  degradation  with  pitch-asynchronous  synthesis  was  much  greater  than 
that  associated  with  pitch-asynchronous  analysis.  Of  course,  significant 
pitch-pulse  location  errors  in  the  synthesizer  excitation  function  are 
far  more  noticeable  than  either  of  the  above  degradations.  A major  dif- 
ficulty with  pitch-synchronous  analysis  is  that  the  analysis  window  var- 
ies in  size  with  the  speaker's  pitch. 

Since  better  performance  was  achieved  with  pitch-synchronous  analy- 
sis, investigation  of  time-domain  (i.e.,  absolute  pitch-pulse  placement) 
pitch  extraction  was  performed.  The  difficulty  of  constructing  a good, 
reliable  time-domain  pitch  extractor  is  great.  The  reader  is  referred 
to  the  Task  3 report  for  further  details.  Here,  it  suffices  to  say  that 
we  developed  an  algorithm  that  greatly  simplified  the  job  of  hand  placing 
pitch  marks.  A human  operator  (needed  to  correct  occasional  pitch  errors) 
using  this  algorithm  can  generate  a set  of  absolute-time  pitch-pulse 
marks  that,  when  used  with  pitch-synchronous  LPC  analysis  and  synthesis, 
produces  synthetic  speech  virtually  indistinguishable  from  the  input 
speech.  These  absolute  pitch  marks  serve  as  a useful  reference  set  for 


\ 


comparison  wich  the  outputs  of  more  practical  pitch  extractors.  A com- 
puter prop'*’ am  has  been  developed  that  computes  the  standard  deviation 
between  two  sets  of  pitch  marks,  making  it  convenient  to  compare  any 
absolute-time  pitch  extractor  with  the  best  possible  pitch  marks. 

Based  on  our  simulations  with  inferior  pitch  extractors,  we  deter- 
mined that  the  required  accuracy  (on  a pitch  of  100  Hz)  is  approximately 
2 Hz  rms.  That  is,  a set  of  pitch  marks  with  a standard  deviation  of 
2 Hz,  with  respect  to  the  best  set  of  hand-marked  pitch  pulses,  produced 
acceptable  quality  synthetic  speech.  However,  when  the  standard  deviation 
was  increased  to  4 Hz,  a definite  roughness  was  perceptible  in  the  syn- 
thetic speech.  The  required  pitch  accuracy  scales  with  frequency  so  that 
1-Hz  and  4-Hz  standard  deviations  are  acceptable  at  pitches  of  50  and 
200  Hz,  respectively. 

Use  of  an  excitation  function  that  consists  of  a mixture  of  pulses 
and  random  noise  produces  very  high  quality  synthetic  speech.  No  quality 
degradation  was  found  with  this  concept  when  the  proper  combination  rule 
was  used.  In  fact,  the  mixture  concept  seemed  to  offer  an  unexpected 
degree  of  robustness  with  respect  to  a variety  of  system  degradations. 

For  example,  the  use  of  the  noise  mixture  concept,  rather  than  a hard 
buzz-hiss  decision,  improved  the  quality  of  the  synthetic  speech  with 
pitch-asynchronous  synthesis.  Furthermore,  the  mixture  concept  is  clearly 
better  suited  to  handling  signals  such  as  the  voiced  fricatives.  The 
major  question  is  whether  the  improvement  is  worth  the  effort  of  trans- 
mitting two  or  three  extra  bits  each  analysis  block  to  convey  this  infor- 
mation. For  the  first  systems  developed,  it  is  clearly  an  unnecessary 
luxury.  However,  future  systems  may  find  this  structure  desirable. 

The  major  contribution  of  our  research  has  been  the  development  of 
an  adaptive  data  compression  algorithm  for  the  linear  predictive  coeffi- 
cients. The  algorithm  (known  as  DELC0)  recognizes  steady-state  segments 
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of  speech  and  transmits  new  LPC  parameters  only  when  there  are  signifi- 
cant changes  in  the  parameter  values  from  the  previously  transmitted 
values.  Thus,  an  adaptive  sampling  system  is  used  between  the  LPC  analy- 
sis system  and  the  transmission  system.  The  DELCO  algorithm  is  preferable 
to  a fixed,  low-rate  LPC  analysis  system,  since  DELCO  can  respond  to 
rapid  changes  in  signal  structure  when  necessary.  By  contrast,  the  fixed, 
lower-rate  LPC  system  (with  the  same  average  transmitted  bit  rate)  will 
miss  or  will  not  accurately  represent  these  rapid  changes. 

The  result  of  this  data  compression  is  a reduction  in  the  required 
average  data  rate  by  a factor  in  excess  of  two,  with  no  discernible  qual- 
ity loss.  The  exact  compression  factor  depends  on  the  speaker  and  the 
utterance.  Frequently,  the  compression  is  significantly  greater  than  two 
to  one.  DF'CO  produces  a nonuniform  data  rate  since  it  is  based  on  adap- 
tive sampling  a fixed-rate  system.  Data  compression  systems  that  pro- 
duce nonuniform  data  rates  require  rate-smoothing  buffers  to  interface 
with  synchronous  communication  systems.  However,  DELCO  can  be  interfaced 
with  an  asynchronous  communication  system,  such  as  a packet-switching 
transmission  system,  without  requiring  rate-smoothing  buffers.  Thus, 

DELCO  is  ideally  suited  for  operation  with  packet-switching  systems. 

In  summary,  the  major  contribution  of  our  research  has  been  the  de- 
velopment of  the  adaptive  data  compression  algorithm  DELCO.  DELCO  re- 
duces the  average  data  rate  of  an  LPC  vocoder  by  a factor  of  two  or  more 
while  maint  ining  excellent  speech  quality.  DELCO  is  a proved  concept 
that  can  be  readily  interfaced  with  packet-switching  systems  and  other 
asynchronous  communication  systems. 
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ADAPTIVE  SPEECH  COMPRESSION  FOR  PACKET  COMMUNICATION  bYSTEMS 


Preceding  page  blank 
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Appendix  A 


ADAPTIVE  SPEECH  COMPRESSION  FOR  PACKET  COMMUNICATION  SYSTEMS* 

D.  T.  Magi 11 

Stanford  Research  Institute 
Menlo  Park,  California  94025 

Packet  communication  systems  offer  many  significant  advantages  for 
low  duty  factor  user  populations.  These  advantages  can  be  applied  to 
voice  communication.  Additional  data  compression  beyond  that  achievable 
with  the  new  linear  predictive  encoding  techniques  can  be  obtained  by 
exploiting  the  asynchronous  character  of  the  packet  communication  channel. 
The  adaptive  data  compression  algorithm  DELCO  achieves  a compression  fac- 
tor greater  than  two  while  maintaining  high  quality. 

1 INTRODUCTION 

The  conventional  approach  to  joint  utilization  of  a common  communi- 
cation resource  among  multiple  users  is  frequency-division  multiple  ac- 
cess (FDMA) . ^ Each  user  is  assigned  a separate  frequency  channel  (and 
in  some  cases,  such  as  satellite  communication,  a fraction  of  the  avail- 
able power)  on  a dedicated  basis.  This  traditional  approach  is  efficient 
for  static  user  populations  and  has  been  used  with  great  success  for 
analog  communication  systems. 


This  work  was  supported  by  the  Advanced  Research  Projects  Agency  of 
the  Department  of  Defense  (DAHC04-72-C-0009) . 

In  this  appendix  the  term  multiple  access  is  used  generally  and 
includes,  as  a special  case,  multiplexing,  i.e.,  the  case  when  all 
users  are,  effectively,  collocated. 

Preceding  page  blank 
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The  advent  ol  the  digital  compu  ;er  and  its  associated  digital  tech- 
nology has  had  a tremendous  impact  on  both  the  concepts  and  the  hardware 
ol  communication  systems.  In  particular,  digital  modulation  has  rapidly 
grown  in  prominence  due  to  theoretical  and  hardware  advantages.  Digital 
signaling  has  been  employed  successfully  with  the  conventional  FD1MA  ap- 
proach, However,  it  has  been  recognized  that  time-division  multiple  ac- 
cess (TDM A)  offers  significant  advantages.  With  TDM A the  communication 
resource,  a single  wideband  channel,  is  shared  on  a time-division  basis. 
Thus,  for  example,  the  problem  of  frequency  stability  for  many  narrow- 
band  channels  is  greatly  alleviated.  In  many  parts  of  the  TDMA  communica- 
tion system,  a single  piece  of  time-shared  equipment  replaces  multiple 
units  in  the  conventional  FDMA  system.  This  is  possible  due  to  the  in- 
herent high  speed  of  present  day  digital  circuits.  There  are  other  ad- 
vantages of  digital  TDMA  systems,  which  are  not  listed  in  the  interest 
of  brevity-  The  important  point  is  that,  so  far,  the  discussion  refers 
to  a synchronous  TDM  or  TDMA  system  with  a relatively  static  user  popula- 
tion, each  user  receiving  a dedicated  link.  Such  a system  might  be  re- 
configured relatively  infrequently,  perhaps  once  a day  or  once  a month. 

In  practice  there  are  many  communication  environments  in  which  the 
user  population  possesses  far  different  characteristics.  For  example,  a 
communication  system  may  consist  of  very  many  remote  data  terminals  ac- 
cessing a central  cc  nputer.  In  this  case,  these  data  terminals  might 
have  a very  low  duty  factor  and  have  independent  statistics.  These  mes- 
sages might  be  very  short  in  duration  and  occur  randomly.  For  such  a 
system  the  conventional  FDMA  or  the  relatively  recent  digital  TDMA  system 
might  be  quite  inefficient.  The  basic  problem  is  that  these  systems  have 
been  designed  on  the  basis  of  the  dedicated  circuit  concept.  This  con- 
cept simply  is  not  suited  to  a very  large  user  popi lation  that  has  a very 
low  duty  factor.  For  example,  there  simply  may  not  be  enough  bandwidth 
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to  allocate  each  of  the  many  system  users  a dedicated  circuit.  Even  if 
it  were  possible,  inefficient  use  ot  the  communication  resource  would 
result. 

One  efficient  method  of  operating  with  such  a user  population  is 
known  as  packet  communication.1  > 3 >a  * With  this  system  all  users  share 
a common  wideband  channel  in  a random,  asynchronous  mode.  Each  user 
transmits  its  information  in  packets  or  short  bursts.  These  packets  con- 
sist of  the  data  plus  preamble  bits  that  carry  source  and  sink  (destina- 
tion) information.  Parity  bits  are  also  attached  for  error  detection 
and  correctir  i. 

Many  forms  and  variations  of  packet  communication  are  possible. 
However,  the  following  example  suffices  to  illustrate  the  major  concepts. 
If  the  message  is  received  correctly  at  the  intended  destination,  i.e., 
i,o  parity  errors  are  detected,  then  an  appropriate  acknowledgment  is 
transmitted  back  to  the  sender.  If  the  acknowledgment  is  correctly  re- 
ceived, then  the  message  is  removed  from  the  sender's  buffer  storage  and 
the  sender  is  ready  to  progress  to  its  next  message.  However,  if  an  in- 
correct message  is  received  at  the  destination,  a repeat  request  is  gen- 
erated. When  this  is  correctly  received  at  the  source  the  original  mes- 
sage is  repeated  and  the  process  continues  as  described  above.  Most 
often,  the  necessity  for  a repeat  transmission  is  generated  by  the  simul- 
taneous transmission  of  two  or  more  messages  from  random  sources.  How- 
ever, receiver  noise  may  occasionally  cause  such  a repeat  request. 

Clearly  as  the  system  usage  factor  becomes  higher,  more  frequent  re- 
peat requests  will  become  necessary.  Thus,  the  effective  system  usage 
will  increase,  resulting  in  further  repeat  requests.  Such  a system  has 
a snowballing  effect  if  the  system  usage  becomes  excessively  high. 


* 

References  are  listed  at  the  end  of  this  appendix. 


With  a well-designed  system  the  usage  factor  can  be  kept  appropriately 
low  and  this  problem  avoided.  The  net  result  is  that  for  a sufficiently 
laige  and  low-duty  factor  user  population,  packet  communication  can  offer 
significant  advantages  over  the  conventional  dedicated  circuit  approach, 
lui  thei  more,  since  packet  communication  can  lie  regarded  as  a form  of 
astnchi onous  TDMA,  it  possesses  most  of  the  advantages  of  TDMA  with 
lespect  to  1' DMA . Consequently,  packet  communication  offers  many  important 
advantages.  A very  significant  characteristic  of  such  systems  is  their 
asynchronous,  random  signal  flow. 

To  date,  the  advantages  of  packet  communication  have  been  described 
with  respect  to  data  systems.  However,  voice  communication  systems  fre- 
quently have  user  populations  with  similar  characteristics,  e.g.,  low- 
duty  factor.  Thus,  voice  communication  systems  need  to  be  considered 
from  the  packet  communication  system  viewpoint.  Furthermore,  in  many 
cases,  it  is  desirable  to  mix  voice  and  data  within  a common  system.  In 
addition,  the  security  advantages  of  digitized  speech  are  well  recognized. 
Consequently,  the  performance  and  capabilities  of  digitized  voice  in 
packet  communication  systems  were*  investigated, 

11  SPEECH  COMPRESSION 

It  is  well  known  that  digital  transmission  of  speech  is  a difficult 
problem  with  many  trade-offs  between  data  rate,  system  complexity,  and 
voice  quality.  Simple  systems  such  as  delta  modulation  (and  its  numerous 
variations)  offer  high  quality,  i.e.,  input  and  output  virtually  indis- 
tinguishable, only  for  high  data  rates.  Complex  systems  such  ns  vocoders 
operate  at  modest  signaling  rates,  i.e.,  2400  to  9600  baud,  but  are  prone 
to  providing  inconsistent  quality.  That  is,  while  high  intelligibility 


By  high  quality  we  reier  to  the  quality  obtainable  in  a standard  4-kHz 
phone  channel. 
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may  be  maintained,  loss  of  speaker  identification,  emotional  content, 
and  naturalness  may  result  under  certain  circumstances.  Thus,  with  con- 
ventional approaches  it  does  not  appear  possible  to  obtain  the  desired 
high  quality  with  a data  rate  that  can  readily  be  transmitted  through  a 
4-kliz  phone  circuit. 

At  pi esent  the  most  promising  new  technique  for  speech  digitization 
is  based  on  linear  predictive  encoding.4-7  With  linear  predictive  en- 
coding, short-term  properties  of  the  speech  process  S(t)  are  deduced  by 
posing  the  linear  one-step  prediction  problem.  Thu l is,  it  is  desired 
to  select  a set  of  p coefficients  [ai]  such  that  the  error 

P 

E(t)  = S(t)  ais(t  - i) 

i=l 

is  minimized  in  a mean-square  error  sense  over  some  interval.  While 
there  are  several  formulations  of  the  problem,  it  is  convenient  to 
choose  the  following  example.  The  mean-square  error  is  minimized  over 
a finite  block  size  of  100  samples  or  a pitch  period,  depending  on 
whether  the  speech  signal  is  voiced  or  unvoiced.* 

Posing  the  above  problem  leads  to  a set  of  p simultaneous  equations 
in  the  autocorrelation  coefficients  and  the  unknown  linear  predictive 
coefficients  (LPC) , i.e.,  the  !a^},  which  may  be  solved  for  the  latter. 
Figure  A-l  illustrates  these  equations  in  matrix  form.  The  LPC  partially 
characterize  the  speech  process  on  a short-term  basis  and,  in  fact,  can 

* 

Here  we  assume  a sampling  rate  of  approximately  10  kHz  so  that  an 
analysis  block  of  100  samples  corresponds  to  a 100-Hz  refresh  rate  on 
the  analysis.  This  appears  to  be  sufficiently  often  to  track  the 
changes  in  the  vocal  tract  conf iguration. 
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FIGURE  A-1  MATRIX  FORMULATION  OF  LPC  EQUATIONS 

be  readily  related  to  the  short-term  power  spectral  density.  This  power 
spectral  density  is  known  to  be  an  adequate  characterization  of  the 
speech  process  when  used  in  conjunction  with  other  important  parameters 
such  as  the  voiced-unvoiced  (V/UV)  decision,  the  pitch,  and  the  overall 
signal  power. 

If  the  LPC  (or  a suitably  transformed  version  of  them)  and  the  V/UV, 
pitch,  and  power  parameters  are  encoded  and  transmitted,  the  receiver 
can  synthesize  a signal  that  accurately  models  the  input  speech  short- 
term power  spectral  density.7  In  this  case  satisfactory  quality  will  be 
obtained.  The  LPC  parameters  are  used  in  a recursive  (all-pole  filter) 
that  is  excited  by  an  appropriate  source.  For  unvoiced  segments  an  inde- 
pendent noise  generator  is  Used.  For  voiced  segments  an  impulse  generator 
(the  frequency  is  controlled  by  the  pitch  parameter)  is  employed.  In 
both  cases  the  excitation  level  is  controlled  bv  the  power  parameter. 
Figures  A-2  and  A-3  are  block  diagrams  of  the  transmitter  and  receiver, 
respectively. 
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FIGURE  A -3  BLOCK  DIAGRAM  OF  LPC  SYNTHESIZER 
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At  this  point  one  can  note  the  obvious  similarities  with  the  conven- 
tional channel  vocoder  approach.  It  is  reasonable  to  ask  what  advantages 
the  LPC  approach  offers  over  the  conventional  approach.  Basically  the 
higher  quality  of  the  former  approach  can  be  related  to  the  greater  flexi- 
bility of  the  recursive  synthesizing  filter  as  compared  with  the  relatively 
fixed  capabilities  of  the  channel  vocoder  synthesizing  filter.  In  addi- 
tionj  the  LPC  technique  is  directly  suitable  for  computer  processing  and 
digital  implementation.  Note  that  poor  quality  in  the  synthesized  speech 
due  to  errors  in  the  V/UV  decision  and  pitch  extraction  is  not  avoided  by 
adopting  the  LPC  approach.  Since  the  LPC  approach  has  proved  more  suc- 
cessful (on  the  basis  oi  preliminary  research)  than  any  other  speech  com- 
piession  technique , it  has  been  investigated  for  application  with  packet 
communication  systems, 

111  DELCO  ALGORITHM 

To  date  all  LPC  algorithms  and  systems  have  been  designed  for  opera- 
tion with  a synchronous,  dedicated  circuit.  Thus,  both  active  speech  and 
speech  pauses  are  transmitted.  Since  typical  conversational  speech  has 
a duty  factor  of  less  than  50  percent,  it  should  be  possible  to  reduce 
the  bit  rate  of  a typical  LPC  speech  compression  digitizer  by  a factor  of 
two.  With  a normal  synchronous  communication  system  this  would  result 
in  a buffering  problem  since  the  achievable  compression  is  a nonuniform 
function  of  time.  Fortunately  with  packet  communication,  an  asynchronous 
or  burst-type  transmission  is  acceptable  and  no  rate  smoothing  buffer  is 
required. 

The  compression  algorithm  developed  modified  the  basic  LPC  algorithms 
to  permit  adaptive  operation  appropriate  to  the  input  speech.  Data  com- 
pression beyond  that  obtainable  by  the  LPC  algorithms  is  obtained  in  two 
ways.  First,  pauses  in  speech  are  eliminated  by  a TASI-type  speech 
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detector  that  determines  the  presence  or  absence  of  a speech  signal.8 >9 
Second,  steady-state  portions  of  speech  are  recognized  and  only  the  new 
information  is  encoded.  The  synthesizer  maintains  the  previous  parameter 
values  unless  new  values  are  transmitted.  Consequently,  the  proposed 
scheme  transmits  no  unnecessary  speech  information. 

The  necessity  of  transmftting  new  LPC  parameters  is  established  by 
considering  the  energy  in  the  one-step  prediction  error  or  residual  signal 
This  error  energy  is  determined  assuming  that  the  last  transmitted  LPC 
parameter  vector  is  used  to  form  the  prediction.  Rather  than  computing 
the  residual  energy  in  the  obvious  but  lengthy  fashion,  one  can  use  the 
formula 


E<k'<n(J)>  . 
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where  the  superscript  denotes  the  analysis  block  are  the  auto- 

correlation coefficients  in  the  kth  (the  present)  analysis  block,  and 
a{d)  are  the  LPC  parameters  from  the  jth  (previous)  analysis  block.  This 
energy  is  compared  with  the  residual  energy  that  would  be  obtained  if  the 
optimized  LPC  parameters  were  used  to  form  the  predicted  signal. 


E 


(k) 


(k)  -A  (k)  (k)  (k) 

cp  - > a a co 

oo  Lj  i i %i 

i=l 


(A-2) 


79 


If 


DEL,  E(k)(a(j),/E(k)(a(k)) 


(A-3) 


is  less  than  a threshold  value  y then  parameter  vector  a(J)  is  judged  to 
be  sufficiently  accurate  that  it  can  be  used  for  the  present  analysis/ 
synthesis  block. 

Experience  with  the  DELCO  algorithm  indicates  that  a threshold  value 
of  a 40  percent  increase,  i.e.,  y , 1.4,  yields  a compression  factor  of 
approximately  two  without  producing  noticeable  degradation.  Thresholds 
as  high  as  v = 2,  i.e.,  100  percent  increase  in  residual  energy,  have 
been  employed  yielding  compression  factors  of  approximately  five.  While 
the  resulting  speech  is  intelligible,  it  is  noticeably  distorted- 
primari ly  with  an  echo  effect.  Consequently,  a conservative  estimate  of 
the  compression  factor  (while  maintaining  high  quality)  is  two  to  one. 
Table  A-l  presents  the  simulation  results  for  speaker  Number  2 with  the 
sentence,  Pete  Cooper's  dog  toyed  with  Dick  Todd's  cat." 


The  results  of  Table  A-l  are  based  on  this  sentence,  which  has  only 
very  short  pauses,  and  on  the  DELCO  algorithm  without  using  the  TASI-tape 
signal  presence  detector.  With  the  speech  detector  installed  in  the 
voice  digitizer,  it  should  be  possible  to  obtain  an  overall  compression 

factor  of  four  to  one  or  better  since  a user's  average  duty  factor  is 
less  than  50  percent. 


Atal  has  demonstrated  high  quality  speech  at  transmission  rates  in 
the  range  of  2400  to  9600  baud.*  Thus,  one  might  expect  that  the  DELCO 
algorithm  with  packet  communication  might  yield  data  rates  as  low  as  600 
to  2400  baud.  Such  is  not  the  case  for  several  reasons.  First,  the 
lowest  rate  of  2400  baud  is  achieved  by  using  the  low  frame  rate  of  33-1/3 


80 


Table  A-l 


DELCO  COMPRESSION  FACTOR 


Threshold 

Value 

(v) 

Number  of  Blocks 
Transmitted  Out  of 
238  Analysis  Blocks 

Compression 

Factor 

Quality 

1.0 

288 

1.0 

High 

1.15 

210 

1. 37 

High 

1.  *10 

126 

2.  28 

High 

2.0 

62 

4.65 

Distorted 
with  an 
echo  effect 

Hz  rather  than  the  100  Hz  previously  described.  With  such  long  analysis 
blocks,  it  is  less  likely  that  the  subsequent  analysis  blocks  will  pass 
the  DELCO  threshold  test  than  when  shorter  blocks  are  used.  Second,  the 
packet  communication  system  concept  has  overhead  bits  associated  with  it 
and  these  will  increase  the  average  baud  rate  to  convey  a speech  channel. 
At  this  point  it  is  desirable  to  consider  further  this  expansion  factor. 

IV  VOICE  PACKET  FORMAT 

Each  packet  must  convey  appropriate  routing  information  such  as 
source  and  destination  identification.  Since  these  bits  are  a type  of 
fixed  overhead,  it  is  desirable  to  make  each  packet  as  large  as  possible 
to  minimize  the  inefficiency  due  to  the  overhead  bits.  However,  an  in- 
creased packet  length  increases  the  average  propagation  delay  through 
the  network. 

The  minimum  cycle  time  for  the  sink  to  acknowledge  to  the  source 
that  the  packet  was  properly  received  is 


(A-4) 


T , = 2T 

cycle  p 


T + T + T 
m a r 


where  Tp  is  the  physical  propagation  path  delay  and  T„  is  the  message  or 
packet  duration.  T„  is  the  duration  of  the  acknowledgment  message,  and 

T,  iS  ,ht'  Pr°C‘-"’Slnc  dela»'  ln  the  receiver.  If  the  message  or  the  ac- 
knowledgment are  incorrectly  received,  then  it  is  necessary  to  repeat  the 

cycle.  in  this  case  the  network  propagation  delay  is  significantly  in- 
creased. 


Use  of  excessively  long  packets  can  result  in  network  propagation 
delays  that  are  unacceptable.  Nominally,  it  ls  desirable  to  maintain 
the  network  delay  below  0.3  s to  avoid  conversation  difficulties,  such 
as  simultaneous  speech.  However,  it  has  been  reported  that  users  can 
tolerate  delays  as  large  as  1.2  s 10 


In  addition  to  the  maximum  tolerable  delay  effect,  which  limits 
packet  sizes,  there  is  a random  variation  in  the  propagation  delay.  The 
magnitude  of  this  effect  depends  on  the  variables  of  cycle  time  equation 
and  on  the  system  usage  factor,  i.e.,  the  likelihood  of  cycle  repeats. 

For  most  reasonably  designed  systems  the  variation  in  the  network  delay 
will  significantly  distort  the  time  base.  As  a result  it  is  necessary 
to  append  additional  overhead  bits  that  Identify  the  proper  time  place- 
ment for  the  information  bits  describing  the  speech  process.  Many  formats 
are  possible  but  it  is  clearly  advantageous  to  use  relative  timing  infer- 

matic-n  rather  than  absolute  values  since  the  te  m 

..  since  the  foimer  procedure  results  in 

a significant  data  rate  reduction. 


At  present  a variable  packet  structure  is  envisioned.  The  data  are 
arranged  in  the  following  sequence:  (1)  destination,  (2)  source, 
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(3)  parity,  (4)  power  level  and  signal  presence/absence,  (5)  voiced/ 
unvoiced  latio,  (6)  pitch,  (7)  LPC  parameters,  and  (8)  relative  time. 

11  no  signal  is  present,  the  packet  could  be  truncated  after  the  fourth 
position.  Otherwise,  the  full  duration  packet  would  be  transmitted. 

The  synthesizer  at  the  receiver  continues  to  employ  the  previous  values 
until  it  is  signaled  to  change  to  new  values.  Nominally  one  might  expect 
some  60  or  so  overhead  bits  for  source,  destination,  and  parity  bits. 

Thus,  if  the  speech  information  requires  60  or  more  bits,  the  packet  ef- 
ficiency should  exceed  50  percent.  Atal  has  shown  that  72  bits  per 
analysis  block  are  adequate  to  j.'rovide  high  quality  synthetic  speech.5 
Thus,  so  long  as  the  packet  describes  one  or  more  analysis  blocks,  then 
the  packet  efficiency  should  exceed  50  percent. 

The  above  arguments  neglect  the  loss  due  to  the  necessity  of  trans- 
mitting timing  information.  The  number  of  bits  required  depends  on  the 
range  of  the  relative  time  measurement  and  the  required  resolution.  The 
range  can  be  reduced  by  periodically  transmitting  fixed  time  references 
even  when  it  is  unnecessary  to  transmit  new  speech  coefficients.  A time 
i esolut ion  of  10  ms  should  be  adequate  for  the  speech  process  parameters. 

As  a result,  ten  bits  should  be  more  than  sufficient  for  timing  informa- 
tion. Thus,  the  requirement  for  timing  bits  does  not  significantly  af- 
fect the  packet  efficiency. 

V SIMULATION 

The  existing  system  used  to  generate  the  demonstration  tape  has  been 
implemented  on  a large,  general  purpose,  time-shared  computer — a PDP-1C 
that  is  part  of  the  ARPANET.  Input/output  and  display  are  handled 
through  an  auxiliary  PDP-15  computer  that  permits  Interactive  operation. 

The  analysis  can  be  performed  either  on  a Toeplitz  or  non-Toeplitz  basis. 6 -•B 
The  synthesizing  filter  can  be  either  of  the  direct  or  ladder  forms. 11,7 
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The  latter  is  preferred  from  a coefficient  accuracy  point  of  vie*.  At 
piesent,  v.ith  no  signiiicant  el  iort  on  algorithm  speed  the  program  runs 
about  60  times  slower  than  real  time. 

The  simulated  performance  is  based  or  pitch-synchronous  analysis 
using  hand-placed  pitch  pulses.  This  was  done  as  the  initial  stage  since 
the  major  effort  of  this  study  was  to  explore  the  interaction  between  the 
LPC  approach  and  the  packet  communication  system — rather  than  to  develop 
pitch  extractors.  The  excitation  function  driving  the  synthesizing  fil- 
ter uses  these  pitch  pulses  for  pitch-synchronous  synthesis.  The  excita- 
tion power  is  divided  between  random  noise  and  pulse  power,  depending  on 
the  energy  in  the  residual  signal  normalized  by  the  signal  power.  If  the 
normalized  residual  energy  exceeds  a threshold,  all  of  the  excitation 
energy  is  noise-like.  Otherwise  the  ratio  of  noise  power  to  total  power 
is  a quadratic  function  of  the  norma. ized  residual  energy.  The  threshold 
value  has  been  selected  on  the  basis  ol  providing  high  quality  synthesis 
for  the  speech  data  base. 

Typically,  the  existing  DELCO  algorilhm  has  been  run  with  a .'ion- 
windowed  Toeplitz  analysis,  a ladder  synthesizer,  14  coefficients,  and  a 
pitch-synchronous  analysis/synthesis  structure.  However,  many  other 
modes  are  possible.  To  date  the  ata  compression  algorithm  has  been  ap- 
plied to  the  excitation  energy,  the  V/UV  ratio,  and  the  LPC  parameters. 

No  attempt  has  been  made  to  adaptively  encode  the  pitch  parameters. 

There  are  several  reasons  for  this.  First,  the  data  rate  required  to 
transmit  pitch  information  is  only  about  one-tenth  of  the  rate  required 
to  characterize  the  complete  speech  process.  Thus,  the  requirement  to 
continually  update  pitch  is  not  burdensome.  Second,  the  quality  of  syn- 
thetic sneech  is  critically  dependent  on  the  pitch  signal.  Thus,  it  is 
important  to  accurately  transmit  pitch  information.  Third,  the  normalized 
residual  energy  is  not  a good  measure  of  changes  in  pitch.  However,  in 
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the  future  it  may  be  desirable  to  develop  a good  method  for  compressing 
the  pitch  information,  e.g.,  perhaps  DPCM  is  such  a method.  At  present 
this  problem  area  has  been  reserved  for  future  study. 

VI  CONCLUSIONS 

The  adaptive  data  compression  technique  DELCO  works  very  well, 
yielding  significant  data  compression  without  degrading  voice  quality. 

It  is  estimated  that  data  rates  about  1200  to  4800  baud  permit  high  qual- 
ity voice  transmission  with  packet  communication  systems.  Such  systems 
avoid  the  wasteful  practice  of  dedicating  circuits  to  low  duty-factor 
users.  Thus,  based  on  this  initial  research  effort,  the  concept  of  a 
voice  packet  communication  system  appears  very  promising.  Much  work  re- 
mains to  develop  the  full  capabilities  of  such  a system. 
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Appendix  B 


DESCRIPTION  OF  SUBROUTINE  i POCH 


The  subroutine  EPOCH  sets  up  the  analysis  and  synthesis  intervals 
based  on  a set  of  (pitch)  marks.  These  marks  can  be  the  output  of  a 
pitch  extractor  (of  the  absolute- time  type)  or  the  result  of  human  pitch- 
mark  placement.  The  linear  predictive  analyses  and  syntheses  are  then 
performed  over  these  epochs.  Figure  B-l  is  a listing  of  the  subroutine 
EPOCH. 
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ill  BP  Dll  T I HE  EPOCH  ('NPTMX , MOVPTS » I MHER , MARK  . NMK , I I.ISW , 
X KAN - MPT  AM  - KSYN  - MPTSVrf- 

D I MEM'S  I DM  MARK  CNMIO 

‘TH I-:  ROUTINE  SETS  UP  EPOCHS  FDR  AMALVSIS  AND  SYNTHESIS. 


INPUTS 

NPTMX 

MOVPTS 

I MHER 
MARK 

NMK 
1 1..! SW 
"NO 
"OVER 


"'FTSY 


F'TO'v 


'"SLO" 


- MAXIMUM  LENGTH  OF  EPOCH  FOR  BLOCK  ANALYSIS 
AND  FOR  PITCH  SYNCH  IN  UNVOICED  INTERVALS. 

- “PTS  TO  MOVE  ANALYSIS  EPOCH  FOR  "OVER'* 

ALSO  LENGTH  OF  SYNTHESIS  EPOCH. 

- ABSOLUTE  TIME  INDEX  FOR  PITCH  MARKS. 

- ARRAY  OF  PITCH  MARKS-.  STORED  AS  ABSOLUTE  TIME 
INDICIES  OF  THE  DATA  ARRAY  <1  SAMPLE  COUNT:. 

- NO.  OF  MARKS. 

- ALPHA  SWITCH  THAT  SELECTS  THE  EPOCH  OPTION. 

- BLOCK  ANALYSIS  WITH  LENGTH  NPTMX. 

X-  BLOCK  ANALYSIS  WITH  LENGTH  NPTMX 
OVERLAPPED  BY  MOVPTS. 

SYNTHESIS  EPOCH  IS  MOVPTS. 

'N'"-  PITCH  SYNCHRONOUS  ANALYSIS  H SYNTHESIS. 
ANALYSIS  & SYNTHESIS  EPOCHS  THE  SAME. 

EPOCH  = DISTANCE  BETWEEN  £ MARKS - 

UNLESS  DISTANCE  > HPT' MX 

THEN? 

IF  DISTANCE  >S#NPTSMX»  EPDCH  = NPTSMX 
IF  DISTANCE  CNPTSMX.  EPDCH  = DISTANCE/S. 

'R"-  PITCH  SYNCH  ANALYSIS  8c  SYNTHESIS  OVERLAPPED. 
SAME  RULES  FOR  SYNTHESIS  EPOCH  AS  "PTSYN" . 
ANALYSIS  EPOCH  IS  3 OVERLAPPING  PERIODS  DURING 
VOICED  PORTIONS  AND  THE  SAME  FOR  UNVOICED. 

ONLY  £ PERIODS  ARE  USED  AT  BEGINNING  OF  VOICED 
INTERVAL  AND  AT  END. 

- BLOCK  SYNCHRONOUS  ANRI  YSIS  WITH  OVERLAPPING 
SYNTHESIZER  COEFFICIENTS  ARE  CHANGED  WITH  PITCH 


OUTPUTS 

KAN 

NF'TAN 

KSYN 

NPTSYM 


- RELATIVE  INDEX  WHERE  ANALYSIS  EPOCH  STARTS. 

- NO.  OF  POINTS  IN  ANALYSIS  EPOCH. 

- RELATIVE  INDEX  WHERE  SYNTHESIS  EPOCH  STARTS. 

- NO.  OF  POINTS  IN  SYNTHESIS  EPOCH. 


LOG  I CAL  OLDMRK - SW I TCH - PT 


FIGURE  B-1  LISTING  OF  SUBROUTINE  EPOCH 
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PR6E  15: 1 


KSYN= 0 MEANS  IMITIRLIZE 

IF  CKSYN.GE. 0 ■ GO  TO  100 
NEXT  = 3 

P T = P Ty  . RND.  " 777 7 6 0 0 0 0 0 0 0 
3 1'J ITCH  = I I..J S W . RND.  "77776 0 0 0 0 0 0 0 
ASSIGN  999  T □ IT 


i_: 

r. 


1 1 j 0 


4 1 0 
430 


435 


437 


IF  (SWITCH. EQ.  i 

NPTAN  = 

NPTMX 

NPTSYN  : 

= NPTMX 

I ZERO  = 

IMHER 

KAN  = - 

NPTSYN 

KSYN*  - 

NPTSYN 

♦♦♦♦♦♦♦♦♦♦ 


NADD  = 0 

IF  <.  NOT.  < I W S 1. 1 . E Q . 0 V E R " . □ R . I W S W . EQ.  SLO  > > GO  TO  1 0 0 
OVERLAPPING  BLOCK  SYNCH.  ANALYSIS 
NPTSYN  = MOVPTS 

KAN  = (NPTSYN-NPTAN) /3  - NPTSYN 
KSYN  = -NPTSYN 


BLOCK  SYNCH.  ANALYSIS 
KAN  = KAN  + NPTSYN 
KSYN  = KSYN  + NPTSYN 
IF  (IWSW.NE.  ''SLO  ) 


♦♦♦♦♦♦♦♦♦♦♦ 


GO  TO  IT 


MODIFIED  OVERLAPPING  BLOCK  SYNCH.  ANALYSIS  ♦♦♦♦♦♦♦♦ 
COEFFICIENTS  SWITCHED  WHEN  A PULSE  OCCURS. 


KAN  = KAN  - NADD 
FIND  NEXT  3 MARKS 
NDIS1  - 0 
ND I S3  = 0 

IF  ('.NEXT . GT.  NMK)  GO  TO  49  0 

Nil  I S3  = MARK (NEXT)  - IMHER  -1 

IF  CNDIS3. GE.MnVPTS-NADD)  GO  TO  435 

NDIS1  = NDIS3 

NEXT  = NEXT  +1 

bO  TO  43 U 

I F (ND  IS  1 . LE . 0 . AND . ND  I S3 . LE . 0>  GO  TO  49  0 
I F C N D I S 1 . L E . 0 ) N D I S 1 = - 3 0 0 0 0 

IF  CNDIS3. LE . 0 > N D I S 3 = 3 0 0 0 0 

PICK  THE  CLOSEST  MARK 
ND IS  = NDIS2 

I F (ND I SS+HADIi-MDVPTS . LT . MOVF'TS-ND I S 1 -NADD) 
ND  I S = N DI  SI 
NEXT  = NEXT-1 

BR  IF  ROOM  FOR  MORE  THAN  ONE  EPOCH 
IF  ( N D IS . G T . 1 . 5 ♦ M D V P T S - N A D D ) GO  TO  49  0 
NPTSYN  = NDIS 

NADD  = NADD  + NPTSYN  - MOVPTS 
GO  TO  999 


GO  TO  437 


3 
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FIGURE  B-1  LISTING  OF  SUBROUTINE  EPOCH  (Continued) 
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HPTSYN  = MDVPTS  - MALI' 

MR  Mi  = 0 
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P I T C H S' Y HCH.  ANAL Y S I S 
IMHER 

"" < HD  I S 

□LDMRK  = .TRUE. 

HD  IS  = MARK  CNEXT)  - IMHER 
HPTSYN  = MI NO  CNDIS? NPTMX) 

I F CNPTSYN) 32  0 ? 33 0 ? 34 0 

HPTSYN  LT  0 MEANS  -1?  IGNORE  IT 
IF  CHEXT.LT.NMK>  GO  TO  330 
HPTSYN  = NPTMX 
HPT  AN  = HPTSYN 
GO  TO  999 

HPTSYN  = 0 MEANS  WE  NEED  R HEW  MARK 
NEXT  = NEXT  + 1 
□LDMRK  = .FALSE. 

GO  TO  310 

HPTSYN  GT  0 MEANS  WE  GOT  A GOODIE 
HPT AN  = NPTSYN 
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KAN  = KSYN 

IS  THERE  ENUF  FOR  TWO  FULL  BLOCKS? 
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HPTSYN  = NDIS/2 
NPTAN  = NPTSYN 
GO  TO  999 


♦♦♦♦♦♦♦♦♦♦♦ 

MARK CNEXT) 

- 1 


FIGURE  B-1  LISTING  OF  SUBROUTINE  EPOCH  (Continued) 


IF  < I WSW . ME . " PTOVR  . OR . OLBMRK)  GO  TO  999 
OVERLAPPING  PITCH  SYNCH.  ANALYSIS  ♦♦♦♦♦♦♦♦♦♦♦ 

OBJECT  IVE^**  TO  USE  PROCEEDING  AND  FOLLOWING 
PITCH  PERIODS  IN  ADDITION  TO  PRESENT  DNE  IN  ANALYSIS 
EPOCH  DURING  VOICED  CP ITCH  MARKED > INTERVALS. 

GENERAL  CASE 

KAN  = MARK <NEXT-£>  - I ZERO 
NPTAN  = MARK  CNEXT+ 1 :■  - I ZERO  - KAN 
IF  CKSYN-KAN.LT.NPTMX>  GD  TO  362 
FIRST  PERIOD 
KAN  = KSYN 

NPTAN  = MARK  CNEXT+l)  - I ZERO  - KAN 
GO  TO  999 
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IF  CMARK CNEXT+1) -MARK <NEXT> .LT.NPTMX. AND. NEXT. LT.NMK)  GO  TO  999 
LAST  PERIOD 
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GO  TO  999 

CONTINUE 

RETURN 

END 


FIGURE  B-1 


LISTING  OF  SUBROUTINE  EPOCH  (Concluded) 
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DESCRIPTION  OF  SIMULATION  TAPE  DEMONSTRATING 
THE  EFFECT  OF  TIMING  ACCURACY  ON  SYNTHETIC  SPEECH  QUALITY 
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Appendix  C 


DESCRIPTION  OF  SIMULATION  TAPE  DEMONSTRATING 
THE  EFFECT  OF  TIMING  ACCURACY  ON  SYNTHETIC  SPEECH  QUALITY 

The  accompanying  tape  is  restricted  to  the  particularly  difficult 
utterance,  "Grasp  the  handle  with  the  hole  in  it,"  by  a male  speaker. 

This  utterance  is  low-pass  filtered  to  a 4-kHz  passband  and  is  sampled 
at  10  kHz.  In  all  cases  14  LPC  parameters,  preemphasis,  a Hamming  window, 
pitch-synchronous  analysis  overlapped  over  three  pitch  periods,  and  ratio 
excitation  (see  Section  V)  were  used. 

Five  groupings  of  three  utterances  are  presented  on  the  tape.  In 
the  first  grouping  one  hears  (1)  the  input  (original),  (2)  the  synthetic, 
and  (3)  the  input  utterances.  The  synthetic  utterance  is  based  on  the 
best  set  of  hand-marked  pitch  pulses  (file  DTG) . Note  the  high  quality 
of  the  synthetic  speech. 

In  the  second  grouping  one  hears  (4)  the  input,  (5)  the  synthetic 
speech  (file  DTO) , and  (6)  the  synthetic  speech  (file  DTG).  The  synthetic 

speech  (file  DTO)  is  based  on  hand-marked  pitch  pulses  on  the  output  of 

a formant-isolation  filter;  less  care  in  iteratively  placing  the  pitch 
pulses  was  taken  than  for  file  DTG.  Roughly  comparable  quality  is  per- 
ceived for  both  synthetic  files. 

In  the  third  grouping  one  hears  (7)  the  input,  (8)  the  synthetic 
speech  (file  DTM) , and  (9)  the  synthetic  speech  (file  DTG).  File  DTM  is 

created  from  pitch  marks  based  on  the  minimum-phase  philosophy.  Note  the 

rough  quality  of  file  DTM. 
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In  the  fourth  grouping  one  hears  (10)  the  input,  (11)  the  synthetic 
speech  (file  DTM/DTG) , and  (12)  the  synthetic  speech  (file  DTG) . File 
DTM  DTG  uses  inaccurate  and  accurate  pitch  marks  for  analysis  and  syn- 
thesis, respectively.  Note  that  inaccurate  analysis  pitch  marks  have 
very  little  effect  on  the  quality  of  the  synthetic  speech. 


In  the  fifth  grouping  one  hears  (13)  the  input,  (14)  the  synthetic 
speech  (file  DTG /DTM ) , and  (15)  the  synthetic  speech  (file  DTG).  File 
DTG/DTM  uses  accurate  and  inaccurate  pitch  marks  for  analysis  and  syn- 
thesis, respectively.  Note  the  very  significant  quality  loss  due  to  the 
use  of  inaccurate  pitch  marks  for  exciting  the  synthesizing  filter. 


Based  on  a comparison  between  the  fourth  and  fifth  groupings,  one 
can  say  that  accurate  excitation  pitch  marks  are  much  more  important 
than  accurate  analysis  marks.  Furthermore,  one  can  say  that  an  rms 
pitch  accuracy  of  2 Hz  (file  DTO)  provides  excellent  speech  quality.  In 
addition,  it  is  clear  from  these  -ecordings  that  it  is  possible  to  pro- 
duce outstanding  speech  quality  with  the  LPC  method. 
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